Project: Identify Customer Segments

In this project, you will apply unsupervised learning techniques to identify segments of the population that form the core customer base for a mail-order sales company in Germany. These segments can then be used to direct marketing campaigns towards audiences that will have the highest expected rate of returns. The data that you will use has been provided by our partners at Bertelsmann Arvato Analytics, and represents a real-life data science task.

This notebook will help you complete this task by providing a framework within which you will perform your analysis steps. In each step of the project, you will see some text describing the subtask that you will perform, followed by one or more code cells for you to complete your work. Feel free to add additional code and markdown cells as you go along so that you can explore everything in precise chunks. The code cells provided in the base template will outline only the major tasks, and will usually not be enough to cover all of the minor tasks that comprise it.

It should be noted that while there will be precise guidelines on how you should handle certain tasks in the project, there will also be places where an exact specification is not provided. There will be times in the project where you will need to make and justify your own decisions on how to treat the data. These are places where there may not be only one way to handle the data. In real-life tasks, there may be many valid ways to approach an analysis task. One of the most important things you can do is clearly document your approach so that other scientists can understand the decisions you've made.

At the end of most sections, there will be a Markdown cell labeled Discussion. In these cells, you will report your findings for the completed section, as well as document the decisions that you made in your approach to each subtask. Your project will be evaluated not just on the code used to complete the tasks outlined, but also your communication about your observations and conclusions at each stage.

In [1]:
# import libraries here; add more as necessary
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import matplotlib.ticker as mtick # to format x axis labels with thousand comma separators

# magic word for producing visualizations in notebook
%matplotlib inline

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler

from sklearn.cluster import KMeans

from workspace_utils import active_session

from sklearn.metrics import silhouette_score

from sklearn.preprocessing import Imputer

import time

Step 0: Load the Data

There are four files associated with this project (not including this one):

  • Udacity_AZDIAS_Subset.csv: Demographics data for the general population of Germany; 891211 persons (rows) x 85 features (columns).
  • Udacity_CUSTOMERS_Subset.csv: Demographics data for customers of a mail-order company; 191652 persons (rows) x 85 features (columns).
  • Data_Dictionary.md: Detailed information file about the features in the provided datasets.
  • AZDIAS_Feature_Summary.csv: Summary of feature attributes for demographics data; 85 features (rows) x 4 columns

Each row of the demographics files represents a single person, but also includes information outside of individuals, including information about their household, building, and neighborhood. You will use this information to cluster the general population into groups with similar demographic properties. Then, you will see how the people in the customers dataset fit into those created clusters. The hope here is that certain clusters are over-represented in the customers data, as compared to the general population; those over-represented clusters will be assumed to be part of the core userbase. This information can then be used for further applications, such as targeting for a marketing campaign.

To start off with, load in the demographics data for the general population into a pandas DataFrame, and do the same for the feature attributes summary. Note for all of the .csv data files in this project: they're semicolon (;) delimited, so you'll need an additional argument in your read_csv() call to read in the data properly. Also, considering the size of the main dataset, it may take some time for it to load completely.

Once the dataset is loaded, it's recommended that you take a little bit of time just browsing the general structure of the dataset and feature summary file. You'll be getting deep into the innards of the cleaning in the first major step of the project, so gaining some general familiarity can help you get your bearings.

In [2]:
# Load in the general demographics data.
azdias = pd.read_csv("Udacity_AZDIAS_Subset.csv", sep=';')

# Load in the feature summary file.
feat_info = pd.read_csv("AZDIAS_Feature_Summary.csv", sep=';')
In [3]:
# my copy to play around. 
# Resaving it cus the original is hard to open in Excel becauase of sep=';'
# feat_info.to_excel("AZDIAS_Feature_Summary_RB_Copy.xlsx") 
In [4]:
# this takes forever to run due to size of dataset
# azdias.to_excel("azdias_RB_Copy.xlsx") 

Check the structure of the data after it's loaded (e.g. print the number of rows and columns, print the first few rows).

Explore Udacity_AZDIAS_Subset.csv: Demographics data for the general population of Germany; 891211 persons (rows) x 85 features (columns).

Configure how much of a dataframe the Notebook shows

https://discuss.analyticsvidhya.com/t/how-to-display-full-dataframe-in-pandas/23298

https://www.ritchieng.com/pandas-changing-display-options/

https://stackoverflow.com/questions/11707586/how-do-i-expand-the-output-display-to-see-more-columns

https://stackoverflow.com/questions/49216197/jupyter-notebook-has-become-very-slow-suddenly

be careful as high limits may slow down the notebook example from Stack Overflow

For example

pd.set_option('display.max_columns', 50000) was causing serious time issues.

I changed it to

pd.set_option('display.max_columns', 50) and problem solved.

In my personal experience,

pd.set_option('display.width', 1000) # this was causing problems

this improved Jupyter slowness pd.set_option('display.width', 50)

In [5]:
# set pandas options so I can view more of a dataframe
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 200)
In [6]:
azdias.head(10)
Out[6]:
AGER_TYP ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER FINANZTYP GEBURTSJAHR GFK_URLAUBERTYP GREEN_AVANTGARDE HEALTH_TYP LP_LEBENSPHASE_FEIN LP_LEBENSPHASE_GROB LP_FAMILIE_FEIN LP_FAMILIE_GROB LP_STATUS_FEIN LP_STATUS_GROB NATIONALITAET_KZ PRAEGENDE_JUGENDJAHRE RETOURTYP_BK_S SEMIO_SOZ SEMIO_FAM SEMIO_REL SEMIO_MAT SEMIO_VERT SEMIO_LUST SEMIO_ERL SEMIO_KULT SEMIO_RAT SEMIO_KRIT SEMIO_DOM SEMIO_KAEM SEMIO_PFLICHT SEMIO_TRADV SHOPPER_TYP SOHO_KZ TITEL_KZ VERS_TYP ZABEOTYP ALTER_HH ANZ_PERSONEN ANZ_TITEL HH_EINKOMMEN_SCORE KK_KUNDENTYP W_KEIT_KIND_HH WOHNDAUER_2008 ANZ_HAUSHALTE_AKTIV ANZ_HH_TITEL GEBAEUDETYP KONSUMNAEHE MIN_GEBAEUDEJAHR OST_WEST_KZ WOHNLAGE CAMEO_DEUG_2015 CAMEO_DEU_2015 CAMEO_INTL_2015 KBA05_ANTG1 KBA05_ANTG2 KBA05_ANTG3 KBA05_ANTG4 KBA05_BAUMAX KBA05_GBZ BALLRAUM EWDICHTE INNENSTADT GEBAEUDETYP_RASTER KKK MOBI_REGIO ONLINE_AFFINITAET REGIOTYP KBA13_ANZAHL_PKW PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB
0 -1 2 1 2.0 3 4 3 5 5 3 4 0 10.0 0 -1 15.0 4.0 2.0 2.0 1.0 1.0 0 0 5.0 2 6 7 5 1 5 3 3 4 7 6 6 5 3 -1 NaN NaN -1 3 NaN NaN NaN 2.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 -1 1 2 5.0 1 5 2 5 4 5 1 1996 10.0 0 3 21.0 6.0 5.0 3.0 2.0 1.0 1 14 1.0 5 4 4 3 1 2 2 3 6 4 7 4 7 6 3 1.0 0.0 2 5 0.0 2.0 0.0 6.0 NaN 3.0 9.0 11.0 0.0 8.0 1.0 1992.0 W 4.0 8 8A 51 0.0 0.0 0.0 2.0 5.0 1.0 6.0 3.0 8.0 3.0 2.0 1.0 3.0 3.0 963.0 2.0 3.0 2.0 1.0 1.0 5.0 4.0 3.0 5.0 4.0
2 -1 3 2 3.0 1 4 1 2 3 5 1 1979 10.0 1 3 3.0 1.0 1.0 1.0 3.0 2.0 1 15 3.0 4 1 3 3 4 4 6 3 4 7 7 7 3 3 2 0.0 0.0 1 5 17.0 1.0 0.0 4.0 NaN 3.0 9.0 10.0 0.0 1.0 5.0 1992.0 W 2.0 4 4C 24 1.0 3.0 1.0 0.0 0.0 3.0 2.0 4.0 4.0 4.0 2.0 3.0 2.0 2.0 712.0 3.0 3.0 1.0 0.0 1.0 4.0 4.0 3.0 5.0 2.0
3 2 4 2 2.0 4 2 5 2 1 2 6 1957 1.0 0 2 0.0 0.0 0.0 0.0 9.0 4.0 1 8 2.0 5 1 2 1 4 4 7 4 3 4 4 5 4 4 1 0.0 0.0 1 3 13.0 0.0 0.0 1.0 NaN NaN 9.0 1.0 0.0 1.0 4.0 1997.0 W 7.0 2 2A 12 4.0 1.0 0.0 0.0 1.0 4.0 4.0 2.0 6.0 4.0 0.0 4.0 1.0 0.0 596.0 2.0 2.0 2.0 0.0 1.0 3.0 4.0 2.0 3.0 3.0
4 -1 3 1 5.0 4 3 4 1 3 2 5 1963 5.0 0 3 32.0 10.0 10.0 5.0 3.0 2.0 1 8 5.0 6 4 4 2 7 4 4 6 2 3 2 2 4 2 2 0.0 0.0 2 4 20.0 4.0 0.0 5.0 1.0 2.0 9.0 3.0 0.0 1.0 4.0 1992.0 W 3.0 6 6B 43 1.0 4.0 1.0 0.0 0.0 3.0 2.0 5.0 1.0 5.0 3.0 3.0 5.0 5.0 435.0 2.0 4.0 2.0 1.0 2.0 3.0 3.0 4.0 6.0 5.0
5 3 1 2 2.0 3 1 5 2 2 5 2 1943 1.0 0 3 8.0 2.0 1.0 1.0 4.0 2.0 1 3 3.0 2 4 7 4 2 2 2 5 7 4 4 4 7 6 0 0.0 0.0 2 4 10.0 1.0 0.0 5.0 3.0 6.0 9.0 5.0 0.0 1.0 5.0 1992.0 W 7.0 8 8C 54 2.0 2.0 0.0 0.0 0.0 4.0 6.0 2.0 7.0 4.0 4.0 4.0 1.0 5.0 1300.0 2.0 3.0 1.0 1.0 1.0 5.0 5.0 2.0 3.0 3.0
6 -1 2 2 5.0 1 5 1 5 4 3 4 0 12.0 0 2 2.0 1.0 1.0 1.0 2.0 1.0 1 10 4.0 2 5 5 7 2 6 5 5 7 7 4 7 7 7 1 0.0 0.0 1 4 0.0 1.0 0.0 6.0 NaN 3.0 9.0 4.0 0.0 1.0 5.0 1992.0 W 5.0 4 4A 22 3.0 2.0 0.0 0.0 1.0 3.0 6.0 4.0 3.0 5.0 3.0 5.0 2.0 5.0 867.0 3.0 3.0 1.0 0.0 1.0 5.0 5.0 4.0 6.0 3.0
7 -1 1 1 3.0 3 3 4 1 3 2 5 1964 9.0 0 1 5.0 2.0 1.0 1.0 1.0 1.0 1 8 5.0 7 7 7 5 6 2 2 7 5 1 1 2 5 5 0 0.0 0.0 1 1 14.0 1.0 0.0 4.0 NaN 5.0 9.0 6.0 0.0 8.0 3.0 1992.0 W 1.0 2 2D 14 2.0 2.0 0.0 0.0 0.0 4.0 2.0 5.0 3.0 4.0 1.0 4.0 1.0 1.0 758.0 3.0 3.0 1.0 0.0 1.0 4.0 4.0 2.0 5.0 2.0
8 -1 3 1 3.0 4 4 2 4 2 2 6 1974 3.0 1 3 10.0 3.0 1.0 1.0 10.0 5.0 1 11 4.0 4 5 4 1 5 6 4 5 2 5 5 3 1 4 3 0.0 0.0 2 6 16.0 1.0 0.0 3.0 NaN 5.0 8.0 2.0 1.0 3.0 4.0 1992.0 W 1.0 1 1A 13 1.0 1.0 0.0 0.0 0.0 5.0 3.0 4.0 4.0 4.0 1.0 3.0 2.0 3.0 511.0 2.0 3.0 2.0 1.0 1.0 3.0 3.0 2.0 4.0 3.0
9 -1 3 2 4.0 2 4 2 3 5 4 1 1975 12.0 1 2 4.0 1.0 1.0 1.0 3.0 2.0 1 15 4.0 2 1 1 3 2 6 6 3 4 7 6 7 1 3 3 0.0 0.0 2 4 17.0 1.0 0.0 4.0 6.0 4.0 3.0 9.0 0.0 3.0 4.0 1992.0 W 7.0 1 1E 15 1.0 3.0 1.0 0.0 0.0 2.0 6.0 5.0 4.0 3.0 1.0 3.0 3.0 1.0 530.0 2.0 3.0 2.0 1.0 1.0 3.0 3.0 2.0 3.0 1.0
In [7]:
azdias.tail(10)
Out[7]:
AGER_TYP ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER FINANZTYP GEBURTSJAHR GFK_URLAUBERTYP GREEN_AVANTGARDE HEALTH_TYP LP_LEBENSPHASE_FEIN LP_LEBENSPHASE_GROB LP_FAMILIE_FEIN LP_FAMILIE_GROB LP_STATUS_FEIN LP_STATUS_GROB NATIONALITAET_KZ PRAEGENDE_JUGENDJAHRE RETOURTYP_BK_S SEMIO_SOZ SEMIO_FAM SEMIO_REL SEMIO_MAT SEMIO_VERT SEMIO_LUST SEMIO_ERL SEMIO_KULT SEMIO_RAT SEMIO_KRIT SEMIO_DOM SEMIO_KAEM SEMIO_PFLICHT SEMIO_TRADV SHOPPER_TYP SOHO_KZ TITEL_KZ VERS_TYP ZABEOTYP ALTER_HH ANZ_PERSONEN ANZ_TITEL HH_EINKOMMEN_SCORE KK_KUNDENTYP W_KEIT_KIND_HH WOHNDAUER_2008 ANZ_HAUSHALTE_AKTIV ANZ_HH_TITEL GEBAEUDETYP KONSUMNAEHE MIN_GEBAEUDEJAHR OST_WEST_KZ WOHNLAGE CAMEO_DEUG_2015 CAMEO_DEU_2015 CAMEO_INTL_2015 KBA05_ANTG1 KBA05_ANTG2 KBA05_ANTG3 KBA05_ANTG4 KBA05_BAUMAX KBA05_GBZ BALLRAUM EWDICHTE INNENSTADT GEBAEUDETYP_RASTER KKK MOBI_REGIO ONLINE_AFFINITAET REGIOTYP KBA13_ANZAHL_PKW PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB
891211 -1 3 1 2.0 3 2 4 3 3 2 2 1963 1.0 0 3 5.0 2.0 1.0 1.0 1.0 1.0 1 8 5.0 4 2 4 6 5 6 4 5 5 5 5 2 3 4 1 0.0 0.0 1 6 14.0 1.0 0.0 6.0 NaN 6.0 9.0 6.0 0.0 1.0 2.0 1992.0 W 3.0 9 9D 51 0.0 3.0 2.0 0.0 0.0 3.0 5.0 4.0 7.0 5.0 4.0 2.0 0.0 7.0 282.0 1.0 4.0 2.0 0.0 2.0 3.0 2.0 3.0 5.0 5.0
891212 -1 4 1 1.0 3 1 5 1 1 5 5 0 4.0 0 1 6.0 2.0 1.0 1.0 1.0 1.0 1 3 5.0 6 4 3 6 6 7 7 6 1 3 3 2 1 1 2 0.0 0.0 1 3 15.0 1.0 0.0 6.0 NaN 6.0 8.0 13.0 0.0 3.0 1.0 1992.0 W 3.0 9 9D 51 0.0 0.0 0.0 1.0 5.0 2.0 1.0 6.0 2.0 2.0 2.0 1.0 0.0 7.0 293.0 1.0 3.0 3.0 2.0 5.0 4.0 1.0 3.0 9.0 5.0
891213 -1 4 2 5.0 3 3 3 5 3 2 6 1966 8.0 1 1 36.0 12.0 11.0 5.0 6.0 3.0 1 11 2.0 2 2 4 5 3 5 7 2 3 4 5 6 2 3 1 0.0 0.0 1 4 21.0 9.0 0.0 5.0 1.0 1.0 9.0 4.0 0.0 3.0 4.0 1992.0 W 4.0 5 5E 34 2.0 1.0 0.0 0.0 0.0 5.0 2.0 5.0 5.0 3.0 3.0 4.0 5.0 6.0 1400.0 2.0 4.0 2.0 1.0 2.0 5.0 5.0 3.0 4.0 4.0
891214 -1 1 2 4.0 1 5 2 3 3 4 1 1978 10.0 0 3 2.0 1.0 1.0 1.0 1.0 1.0 1 14 4.0 5 4 4 3 1 2 4 2 6 4 7 6 5 6 3 0.0 0.0 2 5 17.0 1.0 0.0 5.0 NaN 4.0 9.0 6.0 0.0 1.0 2.0 1992.0 W 3.0 3 3A 23 1.0 2.0 0.0 0.0 0.0 5.0 6.0 6.0 2.0 4.0 1.0 3.0 2.0 1.0 496.0 1.0 4.0 3.0 2.0 5.0 5.0 2.0 3.0 7.0 3.0
891215 -1 2 2 6.0 1 5 2 4 5 4 1 0 12.0 0 2 2.0 1.0 1.0 1.0 2.0 1.0 2 10 1.0 2 5 5 7 2 6 5 5 7 6 7 7 6 7 1 0.0 0.0 1 4 0.0 1.0 0.0 6.0 NaN 3.0 9.0 8.0 0.0 1.0 2.0 1992.0 W 3.0 5 5A 31 0.0 3.0 3.0 0.0 3.0 2.0 2.0 5.0 5.0 3.0 4.0 2.0 2.0 4.0 976.0 2.0 4.0 2.0 1.0 2.0 4.0 4.0 2.0 5.0 2.0
891216 -1 3 2 5.0 1 4 2 5 4 4 1 1976 12.0 0 3 2.0 1.0 1.0 1.0 2.0 1.0 1 14 3.0 2 1 3 3 2 1 6 3 4 4 7 5 4 2 3 0.0 0.0 1 4 17.0 1.0 0.0 4.0 3.0 3.0 4.0 15.0 0.0 8.0 3.0 1992.0 W 3.0 7 7A 41 2.0 1.0 0.0 0.0 0.0 4.0 6.0 3.0 7.0 4.0 3.0 5.0 5.0 5.0 282.0 3.0 2.0 0.0 0.0 1.0 2.0 3.0 NaN NaN NaN
891217 -1 2 1 4.0 3 3 3 2 2 3 6 1970 1.0 0 -1 2.0 1.0 1.0 1.0 1.0 1.0 0 10 5.0 4 4 7 5 4 7 7 4 4 4 4 4 6 2 -1 0.0 0.0 -1 6 16.0 1.0 0.0 6.0 NaN 6.0 9.0 11.0 0.0 8.0 1.0 1992.0 W 5.0 9 9D 51 0.0 0.0 1.0 1.0 5.0 2.0 7.0 6.0 2.0 3.0 3.0 1.0 2.0 7.0 592.0 1.0 3.0 3.0 2.0 4.0 5.0 3.0 4.0 6.0 5.0
891218 -1 2 2 4.0 2 4 2 5 4 3 1 1976 10.0 0 1 0.0 0.0 0.0 0.0 4.0 2.0 1 14 4.0 5 2 5 3 2 3 5 5 7 4 4 5 6 7 2 0.0 0.0 1 4 17.0 0.0 0.0 5.0 NaN NaN 5.0 3.0 0.0 8.0 6.0 1992.0 W 7.0 4 4C 24 1.0 3.0 1.0 0.0 0.0 3.0 5.0 2.0 6.0 4.0 3.0 2.0 3.0 5.0 688.0 4.0 2.0 0.0 0.0 1.0 3.0 4.0 2.0 2.0 3.0
891219 -1 1 1 3.0 1 5 3 5 5 5 1 1994 9.0 0 1 29.0 9.0 9.0 5.0 2.0 1.0 1 14 4.0 7 7 7 5 6 3 2 7 5 2 2 2 7 5 0 0.0 0.0 2 5 0.0 1.0 0.0 6.0 NaN 1.0 9.0 7.0 0.0 8.0 2.0 1992.0 W 5.0 9 9D 51 0.0 3.0 2.0 0.0 0.0 3.0 2.0 6.0 4.0 4.0 4.0 1.0 3.0 7.0 134.0 1.0 4.0 3.0 1.0 5.0 1.0 1.0 4.0 7.0 5.0
891220 -1 4 1 1.0 4 2 5 2 1 5 6 0 12.0 0 2 6.0 2.0 1.0 1.0 1.0 1.0 1 3 1.0 6 6 3 4 6 5 3 6 3 3 3 2 2 2 2 0.0 0.0 1 3 0.0 1.0 0.0 5.0 NaN 6.0 3.0 10.0 0.0 8.0 3.0 1992.0 W 4.0 6 6B 43 1.0 3.0 1.0 1.0 0.0 2.0 6.0 2.0 8.0 4.0 4.0 3.0 0.0 6.0 728.0 3.0 3.0 1.0 0.0 1.0 4.0 4.0 3.0 4.0 5.0
In [8]:
# sample dataframe
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html
azdias.sample(n=20, random_state=42)
Out[8]:
AGER_TYP ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER FINANZTYP GEBURTSJAHR GFK_URLAUBERTYP GREEN_AVANTGARDE HEALTH_TYP LP_LEBENSPHASE_FEIN LP_LEBENSPHASE_GROB LP_FAMILIE_FEIN LP_FAMILIE_GROB LP_STATUS_FEIN LP_STATUS_GROB NATIONALITAET_KZ PRAEGENDE_JUGENDJAHRE RETOURTYP_BK_S SEMIO_SOZ SEMIO_FAM SEMIO_REL SEMIO_MAT SEMIO_VERT SEMIO_LUST SEMIO_ERL SEMIO_KULT SEMIO_RAT SEMIO_KRIT SEMIO_DOM SEMIO_KAEM SEMIO_PFLICHT SEMIO_TRADV SHOPPER_TYP SOHO_KZ TITEL_KZ VERS_TYP ZABEOTYP ALTER_HH ANZ_PERSONEN ANZ_TITEL HH_EINKOMMEN_SCORE KK_KUNDENTYP W_KEIT_KIND_HH WOHNDAUER_2008 ANZ_HAUSHALTE_AKTIV ANZ_HH_TITEL GEBAEUDETYP KONSUMNAEHE MIN_GEBAEUDEJAHR OST_WEST_KZ WOHNLAGE CAMEO_DEUG_2015 CAMEO_DEU_2015 CAMEO_INTL_2015 KBA05_ANTG1 KBA05_ANTG2 KBA05_ANTG3 KBA05_ANTG4 KBA05_BAUMAX KBA05_GBZ BALLRAUM EWDICHTE INNENSTADT GEBAEUDETYP_RASTER KKK MOBI_REGIO ONLINE_AFFINITAET REGIOTYP KBA13_ANZAHL_PKW PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB
848815 1 3 2 4.0 3 1 5 1 2 5 5 0 4.0 0 1 0.0 0.0 0.0 0.0 1.0 1.0 1 5 5.0 1 2 1 2 2 4 6 1 4 4 6 7 1 2 3 0.0 0.0 1 3 12.0 0.0 0.0 6.0 NaN NaN 9.0 6.0 1.0 3.0 1.0 1992.0 W 4.0 9 9B 51 0.0 0.0 2.0 1.0 0.0 1.0 1.0 6.0 1.0 1.0 3.0 1.0 1.0 6.0 646.0 1.0 4.0 3.0 2.0 5.0 5.0 2.0 3.0 9.0 5.0
299816 -1 2 2 4.0 2 5 2 5 4 2 1 1996 10.0 0 2 22.0 6.0 5.0 3.0 5.0 2.0 1 14 1.0 1 5 4 7 2 6 5 5 7 7 7 7 7 7 2 0.0 0.0 1 1 0.0 2.0 0.0 3.0 2.0 3.0 8.0 1.0 0.0 1.0 4.0 1992.0 W 3.0 4 4C 24 4.0 0.0 0.0 0.0 1.0 4.0 6.0 6.0 2.0 4.0 3.0 5.0 5.0 4.0 372.0 2.0 3.0 1.0 0.0 1.0 3.0 3.0 3.0 7.0 5.0
570748 -1 1 1 5.0 3 4 3 5 5 3 4 0 6.0 1 1 3.0 1.0 1.0 1.0 5.0 2.0 1 15 4.0 7 7 7 5 6 2 2 7 4 1 2 1 5 5 1 0.0 0.0 1 1 0.0 1.0 0.0 1.0 NaN 6.0 9.0 1.0 0.0 1.0 5.0 1994.0 W 7.0 4 4C 24 3.0 0.0 0.0 0.0 1.0 5.0 5.0 2.0 5.0 4.0 1.0 4.0 3.0 4.0 769.0 3.0 2.0 0.0 0.0 1.0 3.0 4.0 3.0 2.0 2.0
354371 -1 3 1 6.0 5 2 4 2 3 1 6 1967 10.0 0 1 37.0 12.0 10.0 5.0 9.0 4.0 1 10 3.0 3 6 4 6 7 4 4 5 5 3 2 3 4 4 1 0.0 0.0 1 1 16.0 3.0 0.0 2.0 2.0 2.0 9.0 1.0 0.0 1.0 4.0 1999.0 W 3.0 3 3A 23 NaN NaN NaN NaN NaN NaN 1.0 4.0 3.0 5.0 NaN NaN 5.0 NaN 380.0 3.0 2.0 0.0 0.0 1.0 2.0 3.0 4.0 8.0 5.0
329018 -1 3 2 6.0 3 4 3 5 5 3 4 0 5.0 0 -1 0.0 0.0 0.0 0.0 5.0 2.0 0 0 3.0 2 6 7 5 1 5 3 3 4 7 6 6 5 3 -1 NaN NaN -1 3 NaN NaN NaN 2.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
49016 1 4 1 3.0 5 1 5 1 2 2 5 1942 12.0 1 2 13.0 3.0 1.0 1.0 10.0 5.0 1 4 5.0 6 5 3 6 7 7 7 4 3 3 3 3 4 4 3 0.0 0.0 1 1 10.0 1.0 0.0 4.0 NaN 6.0 8.0 1.0 0.0 1.0 2.0 1992.0 W 3.0 9 9D 51 1.0 2.0 0.0 1.0 0.0 3.0 1.0 5.0 2.0 4.0 3.0 2.0 1.0 7.0 282.0 1.0 4.0 3.0 1.0 3.0 3.0 1.0 4.0 8.0 3.0
454415 -1 3 2 2.0 5 2 3 3 2 1 6 0 5.0 1 3 13.0 3.0 1.0 1.0 10.0 5.0 1 9 5.0 2 3 3 3 4 1 6 3 4 6 7 7 3 3 2 0.0 0.0 2 3 0.0 1.0 0.0 4.0 NaN 3.0 7.0 1.0 0.0 3.0 5.0 1992.0 W 1.0 2 2B 13 2.0 2.0 1.0 0.0 0.0 5.0 6.0 4.0 8.0 3.0 2.0 4.0 1.0 2.0 1400.0 2.0 3.0 2.0 1.0 1.0 5.0 5.0 1.0 4.0 1.0
361475 1 4 2 4.0 3 1 5 1 2 5 5 1949 10.0 0 2 6.0 2.0 1.0 1.0 1.0 1.0 1 5 3.0 5 2 1 2 2 7 7 2 3 7 6 6 1 1 3 0.0 0.0 1 3 11.0 1.0 0.0 6.0 NaN 6.0 8.0 19.0 0.0 1.0 3.0 1992.0 W 2.0 8 8D 55 0.0 0.0 0.0 2.0 4.0 1.0 1.0 6.0 4.0 4.0 3.0 1.0 2.0 6.0 416.0 3.0 2.0 0.0 1.0 1.0 3.0 3.0 4.0 9.0 4.0
148484 2 4 2 1.0 3 1 5 1 3 5 5 0 12.0 0 2 6.0 2.0 1.0 1.0 1.0 1.0 1 1 5.0 5 2 1 1 2 7 7 1 2 7 5 6 2 3 3 0.0 0.0 2 3 4.0 1.0 0.0 5.0 NaN 6.0 9.0 14.0 0.0 3.0 3.0 1992.0 W 2.0 4 4D 24 0.0 0.0 0.0 2.0 4.0 1.0 2.0 5.0 4.0 4.0 3.0 1.0 0.0 5.0 1039.0 2.0 2.0 2.0 2.0 4.0 5.0 3.0 2.0 5.0 3.0
363393 -1 2 1 4.0 3 4 2 4 4 2 4 0 12.0 0 3 4.0 1.0 1.0 1.0 3.0 2.0 1 10 1.0 7 7 6 7 6 3 1 7 4 5 4 3 7 5 1 0.0 0.0 1 4 0.0 2.0 0.0 4.0 NaN 3.0 9.0 5.0 0.0 1.0 5.0 1992.0 W 2.0 2 2B 13 2.0 1.0 0.0 0.0 0.0 5.0 6.0 2.0 7.0 4.0 3.0 4.0 2.0 7.0 570.0 4.0 2.0 0.0 0.0 1.0 3.0 4.0 4.0 5.0 5.0
476938 1 3 1 NaN 5 1 5 1 2 2 5 1950 NaN 1 3 NaN NaN NaN NaN NaN NaN 1 6 NaN 4 4 4 1 5 1 4 5 2 3 5 3 3 4 1 0.0 0.0 2 1 0.0 4.0 0.0 1.0 6.0 2.0 9.0 1.0 0.0 1.0 5.0 1992.0 W 3.0 1 1D 15 4.0 1.0 0.0 0.0 1.0 4.0 7.0 2.0 8.0 5.0 1.0 5.0 NaN 4.0 648.0 3.0 2.0 2.0 1.0 1.0 4.0 4.0 2.0 5.0 1.0
835010 -1 3 2 2.0 5 2 4 3 3 1 6 1964 10.0 1 3 20.0 5.0 2.0 2.0 10.0 5.0 1 9 2.0 4 4 3 3 4 4 6 1 4 4 7 7 4 4 2 0.0 0.0 1 2 14.0 2.0 0.0 1.0 4.0 0.0 9.0 1.0 0.0 1.0 3.0 2000.0 W 2.0 2 2D 14 1.0 3.0 0.0 0.0 0.0 3.0 2.0 5.0 4.0 4.0 0.0 3.0 3.0 0.0 1400.0 3.0 3.0 1.0 0.0 1.0 5.0 5.0 3.0 6.0 2.0
15886 1 3 2 3.0 5 2 3 3 3 1 6 1959 10.0 1 2 39.0 12.0 10.0 5.0 10.0 5.0 1 9 2.0 4 1 3 2 4 4 6 1 4 4 6 5 3 3 3 0.0 0.0 1 3 18.0 3.0 0.0 2.0 NaN 2.0 9.0 1.0 0.0 1.0 4.0 1992.0 W 7.0 3 3A 23 4.0 0.0 0.0 0.0 1.0 4.0 6.0 1.0 5.0 4.0 2.0 5.0 3.0 2.0 486.0 4.0 1.0 0.0 0.0 1.0 2.0 3.0 1.0 3.0 1.0
704805 -1 4 2 4.0 3 3 4 3 2 2 6 0 8.0 1 2 36.0 12.0 11.0 5.0 7.0 3.0 1 9 2.0 5 1 2 3 4 5 7 2 2 7 5 6 3 3 1 0.0 0.0 1 4 21.0 4.0 0.0 3.0 2.0 2.0 9.0 2.0 0.0 3.0 3.0 1992.0 W 4.0 4 4C 24 2.0 2.0 1.0 0.0 0.0 4.0 4.0 6.0 2.0 4.0 2.0 3.0 5.0 4.0 434.0 2.0 3.0 2.0 1.0 1.0 4.0 3.0 4.0 7.0 5.0
4547 -1 1 2 4.0 4 3 2 4 5 1 3 1988 9.0 1 3 35.0 11.0 10.0 5.0 10.0 5.0 1 15 1.0 3 2 4 4 1 2 4 3 6 6 7 4 6 6 0 0.0 0.0 2 1 19.0 2.0 0.0 1.0 2.0 6.0 9.0 1.0 0.0 1.0 3.0 1992.0 W 3.0 2 2A 12 4.0 0.0 0.0 0.0 1.0 5.0 6.0 3.0 5.0 5.0 2.0 5.0 4.0 2.0 413.0 4.0 1.0 1.0 0.0 1.0 3.0 3.0 2.0 4.0 3.0
440575 2 4 2 2.0 5 1 5 2 1 4 6 0 6.0 1 1 13.0 3.0 1.0 1.0 10.0 5.0 1 2 5.0 3 2 2 3 1 7 7 1 3 6 5 6 3 3 1 0.0 0.0 2 3 8.0 1.0 0.0 3.0 NaN 6.0 9.0 1.0 0.0 1.0 5.0 1992.0 W 4.0 4 4C 24 2.0 2.0 0.0 0.0 0.0 4.0 2.0 5.0 4.0 5.0 3.0 4.0 1.0 5.0 404.0 3.0 3.0 1.0 0.0 1.0 3.0 4.0 4.0 7.0 4.0
540600 2 4 2 2.0 4 1 5 1 1 3 2 1921 4.0 0 2 0.0 0.0 0.0 0.0 9.0 4.0 1 1 3.0 4 1 2 1 4 7 7 1 3 4 5 6 1 1 3 0.0 0.0 2 3 0.0 0.0 0.0 3.0 NaN NaN 9.0 2.0 0.0 1.0 2.0 1992.0 W 3.0 7 7A 41 2.0 1.0 0.0 0.0 0.0 5.0 6.0 4.0 7.0 4.0 2.0 4.0 1.0 2.0 231.0 2.0 3.0 1.0 0.0 1.0 2.0 2.0 2.0 5.0 3.0
715032 -1 4 1 4.0 4 4 1 4 4 2 4 1978 9.0 0 2 34.0 11.0 10.0 5.0 9.0 4.0 1 14 4.0 6 5 3 2 7 7 7 4 3 1 1 3 2 2 3 0.0 0.0 2 1 17.0 3.0 0.0 5.0 2.0 2.0 9.0 1.0 0.0 1.0 2.0 1992.0 O 7.0 5 5D 34 2.0 2.0 0.0 0.0 0.0 3.0 6.0 2.0 6.0 4.0 4.0 3.0 5.0 5.0 388.0 2.0 3.0 2.0 0.0 1.0 3.0 3.0 4.0 2.0 4.0
82107 -1 3 1 3.0 3 3 3 3 1 4 6 1970 12.0 0 3 2.0 1.0 1.0 1.0 1.0 1.0 1 12 3.0 6 6 4 6 5 1 4 6 5 5 2 3 5 2 1 0.0 0.0 2 6 16.0 1.0 0.0 6.0 4.0 4.0 3.0 15.0 0.0 3.0 1.0 1992.0 O 5.0 9 9B 51 0.0 0.0 0.0 2.0 4.0 1.0 1.0 6.0 3.0 3.0 3.0 1.0 5.0 2.0 323.0 1.0 4.0 3.0 2.0 5.0 3.0 2.0 4.0 8.0 4.0
197689 -1 4 1 6.0 3 2 5 1 1 4 5 0 8.0 1 2 8.0 2.0 1.0 1.0 3.0 2.0 1 6 5.0 6 6 2 4 7 5 3 4 3 3 3 2 3 4 3 0.0 0.0 1 3 0.0 0.0 0.0 6.0 NaN NaN 9.0 5.0 0.0 3.0 1.0 1990.0 W 5.0 8 8A 51 1.0 2.0 0.0 0.0 5.0 5.0 2.0 6.0 1.0 1.0 2.0 3.0 1.0 3.0 207.0 1.0 3.0 3.0 2.0 5.0 3.0 2.0 4.0 7.0 4.0
In [9]:
def explore_data(df):
    """Explore data. Print basic information and descriptive stats."""
    print("Shape: rows, cols")
    print(df.shape)
    print("")
    print()
    print("Dataframe Information:")
    print(df.info())
    print()
    print("Desciptive stats:")
    print(df.describe())    

explore_data(azdias)    
Shape: rows, cols
(891221, 85)


Dataframe Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891221 entries, 0 to 891220
Data columns (total 85 columns):
AGER_TYP                 891221 non-null int64
ALTERSKATEGORIE_GROB     891221 non-null int64
ANREDE_KZ                891221 non-null int64
CJT_GESAMTTYP            886367 non-null float64
FINANZ_MINIMALIST        891221 non-null int64
FINANZ_SPARER            891221 non-null int64
FINANZ_VORSORGER         891221 non-null int64
FINANZ_ANLEGER           891221 non-null int64
FINANZ_UNAUFFAELLIGER    891221 non-null int64
FINANZ_HAUSBAUER         891221 non-null int64
FINANZTYP                891221 non-null int64
GEBURTSJAHR              891221 non-null int64
GFK_URLAUBERTYP          886367 non-null float64
GREEN_AVANTGARDE         891221 non-null int64
HEALTH_TYP               891221 non-null int64
LP_LEBENSPHASE_FEIN      886367 non-null float64
LP_LEBENSPHASE_GROB      886367 non-null float64
LP_FAMILIE_FEIN          886367 non-null float64
LP_FAMILIE_GROB          886367 non-null float64
LP_STATUS_FEIN           886367 non-null float64
LP_STATUS_GROB           886367 non-null float64
NATIONALITAET_KZ         891221 non-null int64
PRAEGENDE_JUGENDJAHRE    891221 non-null int64
RETOURTYP_BK_S           886367 non-null float64
SEMIO_SOZ                891221 non-null int64
SEMIO_FAM                891221 non-null int64
SEMIO_REL                891221 non-null int64
SEMIO_MAT                891221 non-null int64
SEMIO_VERT               891221 non-null int64
SEMIO_LUST               891221 non-null int64
SEMIO_ERL                891221 non-null int64
SEMIO_KULT               891221 non-null int64
SEMIO_RAT                891221 non-null int64
SEMIO_KRIT               891221 non-null int64
SEMIO_DOM                891221 non-null int64
SEMIO_KAEM               891221 non-null int64
SEMIO_PFLICHT            891221 non-null int64
SEMIO_TRADV              891221 non-null int64
SHOPPER_TYP              891221 non-null int64
SOHO_KZ                  817722 non-null float64
TITEL_KZ                 817722 non-null float64
VERS_TYP                 891221 non-null int64
ZABEOTYP                 891221 non-null int64
ALTER_HH                 817722 non-null float64
ANZ_PERSONEN             817722 non-null float64
ANZ_TITEL                817722 non-null float64
HH_EINKOMMEN_SCORE       872873 non-null float64
KK_KUNDENTYP             306609 non-null float64
W_KEIT_KIND_HH           783619 non-null float64
WOHNDAUER_2008           817722 non-null float64
ANZ_HAUSHALTE_AKTIV      798073 non-null float64
ANZ_HH_TITEL             794213 non-null float64
GEBAEUDETYP              798073 non-null float64
KONSUMNAEHE              817252 non-null float64
MIN_GEBAEUDEJAHR         798073 non-null float64
OST_WEST_KZ              798073 non-null object
WOHNLAGE                 798073 non-null float64
CAMEO_DEUG_2015          792242 non-null object
CAMEO_DEU_2015           792242 non-null object
CAMEO_INTL_2015          792242 non-null object
KBA05_ANTG1              757897 non-null float64
KBA05_ANTG2              757897 non-null float64
KBA05_ANTG3              757897 non-null float64
KBA05_ANTG4              757897 non-null float64
KBA05_BAUMAX             757897 non-null float64
KBA05_GBZ                757897 non-null float64
BALLRAUM                 797481 non-null float64
EWDICHTE                 797481 non-null float64
INNENSTADT               797481 non-null float64
GEBAEUDETYP_RASTER       798066 non-null float64
KKK                      770025 non-null float64
MOBI_REGIO               757897 non-null float64
ONLINE_AFFINITAET        886367 non-null float64
REGIOTYP                 770025 non-null float64
KBA13_ANZAHL_PKW         785421 non-null float64
PLZ8_ANTG1               774706 non-null float64
PLZ8_ANTG2               774706 non-null float64
PLZ8_ANTG3               774706 non-null float64
PLZ8_ANTG4               774706 non-null float64
PLZ8_BAUMAX              774706 non-null float64
PLZ8_HHZ                 774706 non-null float64
PLZ8_GBZ                 774706 non-null float64
ARBEIT                   794005 non-null float64
ORTSGR_KLS9              794005 non-null float64
RELAT_AB                 794005 non-null float64
dtypes: float64(49), int64(32), object(4)
memory usage: 578.0+ MB
None

Desciptive stats:
            AGER_TYP  ALTERSKATEGORIE_GROB      ANREDE_KZ  CJT_GESAMTTYP  \
count  891221.000000         891221.000000  891221.000000  886367.000000   
mean       -0.358435              2.777398       1.522098       3.632838   
std         1.198724              1.068775       0.499512       1.595021   
min        -1.000000              1.000000       1.000000       1.000000   
25%        -1.000000              2.000000       1.000000       2.000000   
50%        -1.000000              3.000000       2.000000       4.000000   
75%        -1.000000              4.000000       2.000000       5.000000   
max         3.000000              9.000000       2.000000       6.000000   

       FINANZ_MINIMALIST  FINANZ_SPARER  FINANZ_VORSORGER  FINANZ_ANLEGER  \
count      891221.000000  891221.000000     891221.000000   891221.000000   
mean            3.074528       2.821039          3.401106        3.033328   
std             1.321055       1.464749          1.322134        1.529603   
min             1.000000       1.000000          1.000000        1.000000   
25%             2.000000       1.000000          3.000000        2.000000   
50%             3.000000       3.000000          3.000000        3.000000   
75%             4.000000       4.000000          5.000000        5.000000   
max             5.000000       5.000000          5.000000        5.000000   

       FINANZ_UNAUFFAELLIGER  FINANZ_HAUSBAUER      FINANZTYP    GEBURTSJAHR  \
count          891221.000000     891221.000000  891221.000000  891221.000000   
mean                2.874167          3.075121       3.790586    1101.178533   
std                 1.486731          1.353248       1.987876     976.583551   
min                 1.000000          1.000000       1.000000       0.000000   
25%                 2.000000          2.000000       2.000000       0.000000   
50%                 3.000000          3.000000       4.000000    1943.000000   
75%                 4.000000          4.000000       6.000000    1970.000000   
max                 5.000000          5.000000       6.000000    2017.000000   

       GFK_URLAUBERTYP  GREEN_AVANTGARDE     HEALTH_TYP  LP_LEBENSPHASE_FEIN  \
count    886367.000000     891221.000000  891221.000000        886367.000000   
mean          7.350304          0.196612       1.792102            14.622637   
std           3.525723          0.397437       1.269062            12.616883   
min           1.000000          0.000000      -1.000000             0.000000   
25%           5.000000          0.000000       1.000000             4.000000   
50%           8.000000          0.000000       2.000000            11.000000   
75%          10.000000          0.000000       3.000000            27.000000   
max          12.000000          1.000000       3.000000            40.000000   

       LP_LEBENSPHASE_GROB  LP_FAMILIE_FEIN  LP_FAMILIE_GROB  LP_STATUS_FEIN  \
count        886367.000000    886367.000000    886367.000000   886367.000000   
mean              4.453621         3.599574         2.185966        4.791151   
std               3.855639         3.926486         1.756537        3.425305   
min               0.000000         0.000000         0.000000        1.000000   
25%               1.000000         1.000000         1.000000        2.000000   
50%               3.000000         1.000000         1.000000        4.000000   
75%               8.000000         8.000000         4.000000        9.000000   
max              12.000000        11.000000         5.000000       10.000000   

       LP_STATUS_GROB  NATIONALITAET_KZ  PRAEGENDE_JUGENDJAHRE  \
count   886367.000000     891221.000000          891221.000000   
mean         2.432575          1.026827               8.154346   
std          1.474315          0.586634               4.844532   
min          1.000000          0.000000               0.000000   
25%          1.000000          1.000000               5.000000   
50%          2.000000          1.000000               8.000000   
75%          4.000000          1.000000              14.000000   
max          5.000000          3.000000              15.000000   

       RETOURTYP_BK_S      SEMIO_SOZ      SEMIO_FAM      SEMIO_REL  \
count   886367.000000  891221.000000  891221.000000  891221.000000   
mean         3.419630       3.945860       4.272729       4.240609   
std          1.417741       1.946564       1.915885       2.007373   
min          1.000000       1.000000       1.000000       1.000000   
25%          2.000000       2.000000       3.000000       3.000000   
50%          3.000000       4.000000       4.000000       4.000000   
75%          5.000000       6.000000       6.000000       6.000000   
max          5.000000       7.000000       7.000000       7.000000   

           SEMIO_MAT     SEMIO_VERT     SEMIO_LUST      SEMIO_ERL  \
count  891221.000000  891221.000000  891221.000000  891221.000000   
mean        4.001597       4.023709       4.359086       4.481405   
std         1.857540       2.077746       2.022829       1.807552   
min         1.000000       1.000000       1.000000       1.000000   
25%         2.000000       2.000000       2.000000       3.000000   
50%         4.000000       4.000000       5.000000       4.000000   
75%         5.000000       6.000000       6.000000       6.000000   
max         7.000000       7.000000       7.000000       7.000000   

          SEMIO_KULT      SEMIO_RAT     SEMIO_KRIT      SEMIO_DOM  \
count  891221.000000  891221.000000  891221.000000  891221.000000   
mean        4.025014       3.910139       4.763223       4.667550   
std         1.903816       1.580306       1.830789       1.795712   
min         1.000000       1.000000       1.000000       1.000000   
25%         3.000000       3.000000       3.000000       3.000000   
50%         4.000000       4.000000       5.000000       5.000000   
75%         5.000000       5.000000       6.000000       6.000000   
max         7.000000       7.000000       7.000000       7.000000   

          SEMIO_KAEM  SEMIO_PFLICHT    SEMIO_TRADV    SHOPPER_TYP  \
count  891221.000000  891221.000000  891221.000000  891221.000000   
mean        4.445007       4.256076       3.661784       1.266967   
std         1.852412       1.770137       1.707637       1.287435   
min         1.000000       1.000000       1.000000      -1.000000   
25%         3.000000       3.000000       2.000000       0.000000   
50%         5.000000       4.000000       3.000000       1.000000   
75%         6.000000       6.000000       5.000000       2.000000   
max         7.000000       7.000000       7.000000       3.000000   

             SOHO_KZ       TITEL_KZ       VERS_TYP       ZABEOTYP  \
count  817722.000000  817722.000000  891221.000000  891221.000000   
mean        0.008423       0.003483       1.197852       3.362438   
std         0.091392       0.084957       0.952532       1.352704   
min         0.000000       0.000000      -1.000000       1.000000   
25%         0.000000       0.000000       1.000000       3.000000   
50%         0.000000       0.000000       1.000000       3.000000   
75%         0.000000       0.000000       2.000000       4.000000   
max         1.000000       5.000000       2.000000       6.000000   

            ALTER_HH   ANZ_PERSONEN      ANZ_TITEL  HH_EINKOMMEN_SCORE  \
count  817722.000000  817722.000000  817722.000000       872873.000000   
mean       10.864126       1.727637       0.004162            4.207243   
std         7.639683       1.155849       0.068855            1.624057   
min         0.000000       0.000000       0.000000            1.000000   
25%         0.000000       1.000000       0.000000            3.000000   
50%        13.000000       1.000000       0.000000            5.000000   
75%        17.000000       2.000000       0.000000            6.000000   
max        21.000000      45.000000       6.000000            6.000000   

        KK_KUNDENTYP  W_KEIT_KIND_HH  WOHNDAUER_2008  ANZ_HAUSHALTE_AKTIV  \
count  306609.000000   783619.000000   817722.000000        798073.000000   
mean        3.410640        3.933406        7.908791             8.287263   
std         1.628844        1.964701        1.923137            15.628087   
min         1.000000        0.000000        1.000000             0.000000   
25%         2.000000        2.000000        8.000000             1.000000   
50%         3.000000        4.000000        9.000000             4.000000   
75%         5.000000        6.000000        9.000000             9.000000   
max         6.000000        6.000000        9.000000           595.000000   

        ANZ_HH_TITEL    GEBAEUDETYP    KONSUMNAEHE  MIN_GEBAEUDEJAHR  \
count  794213.000000  798073.000000  817252.000000     798073.000000   
mean        0.040647       2.798641       3.018452       1993.277011   
std         0.324028       2.656713       1.550312          3.332739   
min         0.000000       1.000000       1.000000       1985.000000   
25%         0.000000       1.000000       2.000000       1992.000000   
50%         0.000000       1.000000       3.000000       1992.000000   
75%         0.000000       3.000000       4.000000       1993.000000   
max        23.000000       8.000000       7.000000       2016.000000   

            WOHNLAGE    KBA05_ANTG1    KBA05_ANTG2    KBA05_ANTG3  \
count  798073.000000  757897.000000  757897.000000  757897.000000   
mean        4.052836       1.494277       1.265584       0.624525   
std         1.949539       1.403961       1.245178       1.013443   
min         0.000000       0.000000       0.000000       0.000000   
25%         3.000000       0.000000       0.000000       0.000000   
50%         3.000000       1.000000       1.000000       0.000000   
75%         5.000000       3.000000       2.000000       1.000000   
max         8.000000       4.000000       4.000000       3.000000   

         KBA05_ANTG4   KBA05_BAUMAX      KBA05_GBZ       BALLRAUM  \
count  757897.000000  757897.000000  757897.000000  797481.000000   
mean        0.305927       1.389552       3.158580       4.153043   
std         0.638725       1.779483       1.329537       2.183710   
min         0.000000       0.000000       1.000000       1.000000   
25%         0.000000       0.000000       2.000000       2.000000   
50%         0.000000       1.000000       3.000000       5.000000   
75%         0.000000       3.000000       4.000000       6.000000   
max         2.000000       5.000000       5.000000       7.000000   

            EWDICHTE     INNENSTADT  GEBAEUDETYP_RASTER            KKK  \
count  797481.000000  797481.000000       798066.000000  770025.000000   
mean        3.939172       4.549491            3.738306       2.592991   
std         1.718996       2.028919            0.923193       1.119052   
min         1.000000       1.000000            1.000000       0.000000   
25%         2.000000       3.000000            3.000000       2.000000   
50%         4.000000       5.000000            4.000000       3.000000   
75%         6.000000       6.000000            4.000000       3.000000   
max         6.000000       8.000000            5.000000       4.000000   

          MOBI_REGIO  ONLINE_AFFINITAET       REGIOTYP  KBA13_ANZAHL_PKW  \
count  757897.000000      886367.000000  770025.000000     785421.000000   
mean        2.963540           2.698691       4.257967        619.701439   
std         1.428882           1.521524       2.030385        340.034318   
min         1.000000           0.000000       0.000000          0.000000   
25%         2.000000           1.000000       3.000000        384.000000   
50%         3.000000           3.000000       5.000000        549.000000   
75%         4.000000           4.000000       6.000000        778.000000   
max         6.000000           5.000000       7.000000       2300.000000   

          PLZ8_ANTG1     PLZ8_ANTG2     PLZ8_ANTG3     PLZ8_ANTG4  \
count  774706.000000  774706.000000  774706.000000  774706.000000   
mean        2.253330       2.801858       1.595426       0.699166   
std         0.972008       0.920309       0.986736       0.727137   
min         0.000000       0.000000       0.000000       0.000000   
25%         1.000000       2.000000       1.000000       0.000000   
50%         2.000000       3.000000       2.000000       1.000000   
75%         3.000000       3.000000       2.000000       1.000000   
max         4.000000       4.000000       3.000000       2.000000   

         PLZ8_BAUMAX       PLZ8_HHZ       PLZ8_GBZ         ARBEIT  \
count  774706.000000  774706.000000  774706.000000  794005.000000   
mean        1.943913       3.612821       3.381087       3.167854   
std         1.459654       0.973967       1.111598       1.002376   
min         1.000000       1.000000       1.000000       1.000000   
25%         1.000000       3.000000       3.000000       3.000000   
50%         1.000000       4.000000       3.000000       3.000000   
75%         3.000000       4.000000       4.000000       4.000000   
max         5.000000       5.000000       5.000000       9.000000   

         ORTSGR_KLS9      RELAT_AB  
count  794005.000000  794005.00000  
mean        5.293002       3.07222  
std         2.303739       1.36298  
min         0.000000       1.00000  
25%         4.000000       2.00000  
50%         5.000000       3.00000  
75%         7.000000       4.00000  
max         9.000000       9.00000  
In [10]:
azdias.describe().transpose()
Out[10]:
count mean std min 25% 50% 75% max
AGER_TYP 891221.0 -0.358435 1.198724 -1.0 -1.0 -1.0 -1.0 3.0
ALTERSKATEGORIE_GROB 891221.0 2.777398 1.068775 1.0 2.0 3.0 4.0 9.0
ANREDE_KZ 891221.0 1.522098 0.499512 1.0 1.0 2.0 2.0 2.0
CJT_GESAMTTYP 886367.0 3.632838 1.595021 1.0 2.0 4.0 5.0 6.0
FINANZ_MINIMALIST 891221.0 3.074528 1.321055 1.0 2.0 3.0 4.0 5.0
FINANZ_SPARER 891221.0 2.821039 1.464749 1.0 1.0 3.0 4.0 5.0
FINANZ_VORSORGER 891221.0 3.401106 1.322134 1.0 3.0 3.0 5.0 5.0
FINANZ_ANLEGER 891221.0 3.033328 1.529603 1.0 2.0 3.0 5.0 5.0
FINANZ_UNAUFFAELLIGER 891221.0 2.874167 1.486731 1.0 2.0 3.0 4.0 5.0
FINANZ_HAUSBAUER 891221.0 3.075121 1.353248 1.0 2.0 3.0 4.0 5.0
FINANZTYP 891221.0 3.790586 1.987876 1.0 2.0 4.0 6.0 6.0
GEBURTSJAHR 891221.0 1101.178533 976.583551 0.0 0.0 1943.0 1970.0 2017.0
GFK_URLAUBERTYP 886367.0 7.350304 3.525723 1.0 5.0 8.0 10.0 12.0
GREEN_AVANTGARDE 891221.0 0.196612 0.397437 0.0 0.0 0.0 0.0 1.0
HEALTH_TYP 891221.0 1.792102 1.269062 -1.0 1.0 2.0 3.0 3.0
LP_LEBENSPHASE_FEIN 886367.0 14.622637 12.616883 0.0 4.0 11.0 27.0 40.0
LP_LEBENSPHASE_GROB 886367.0 4.453621 3.855639 0.0 1.0 3.0 8.0 12.0
LP_FAMILIE_FEIN 886367.0 3.599574 3.926486 0.0 1.0 1.0 8.0 11.0
LP_FAMILIE_GROB 886367.0 2.185966 1.756537 0.0 1.0 1.0 4.0 5.0
LP_STATUS_FEIN 886367.0 4.791151 3.425305 1.0 2.0 4.0 9.0 10.0
LP_STATUS_GROB 886367.0 2.432575 1.474315 1.0 1.0 2.0 4.0 5.0
NATIONALITAET_KZ 891221.0 1.026827 0.586634 0.0 1.0 1.0 1.0 3.0
PRAEGENDE_JUGENDJAHRE 891221.0 8.154346 4.844532 0.0 5.0 8.0 14.0 15.0
RETOURTYP_BK_S 886367.0 3.419630 1.417741 1.0 2.0 3.0 5.0 5.0
SEMIO_SOZ 891221.0 3.945860 1.946564 1.0 2.0 4.0 6.0 7.0
SEMIO_FAM 891221.0 4.272729 1.915885 1.0 3.0 4.0 6.0 7.0
SEMIO_REL 891221.0 4.240609 2.007373 1.0 3.0 4.0 6.0 7.0
SEMIO_MAT 891221.0 4.001597 1.857540 1.0 2.0 4.0 5.0 7.0
SEMIO_VERT 891221.0 4.023709 2.077746 1.0 2.0 4.0 6.0 7.0
SEMIO_LUST 891221.0 4.359086 2.022829 1.0 2.0 5.0 6.0 7.0
SEMIO_ERL 891221.0 4.481405 1.807552 1.0 3.0 4.0 6.0 7.0
SEMIO_KULT 891221.0 4.025014 1.903816 1.0 3.0 4.0 5.0 7.0
SEMIO_RAT 891221.0 3.910139 1.580306 1.0 3.0 4.0 5.0 7.0
SEMIO_KRIT 891221.0 4.763223 1.830789 1.0 3.0 5.0 6.0 7.0
SEMIO_DOM 891221.0 4.667550 1.795712 1.0 3.0 5.0 6.0 7.0
SEMIO_KAEM 891221.0 4.445007 1.852412 1.0 3.0 5.0 6.0 7.0
SEMIO_PFLICHT 891221.0 4.256076 1.770137 1.0 3.0 4.0 6.0 7.0
SEMIO_TRADV 891221.0 3.661784 1.707637 1.0 2.0 3.0 5.0 7.0
SHOPPER_TYP 891221.0 1.266967 1.287435 -1.0 0.0 1.0 2.0 3.0
SOHO_KZ 817722.0 0.008423 0.091392 0.0 0.0 0.0 0.0 1.0
TITEL_KZ 817722.0 0.003483 0.084957 0.0 0.0 0.0 0.0 5.0
VERS_TYP 891221.0 1.197852 0.952532 -1.0 1.0 1.0 2.0 2.0
ZABEOTYP 891221.0 3.362438 1.352704 1.0 3.0 3.0 4.0 6.0
ALTER_HH 817722.0 10.864126 7.639683 0.0 0.0 13.0 17.0 21.0
ANZ_PERSONEN 817722.0 1.727637 1.155849 0.0 1.0 1.0 2.0 45.0
ANZ_TITEL 817722.0 0.004162 0.068855 0.0 0.0 0.0 0.0 6.0
HH_EINKOMMEN_SCORE 872873.0 4.207243 1.624057 1.0 3.0 5.0 6.0 6.0
KK_KUNDENTYP 306609.0 3.410640 1.628844 1.0 2.0 3.0 5.0 6.0
W_KEIT_KIND_HH 783619.0 3.933406 1.964701 0.0 2.0 4.0 6.0 6.0
WOHNDAUER_2008 817722.0 7.908791 1.923137 1.0 8.0 9.0 9.0 9.0
ANZ_HAUSHALTE_AKTIV 798073.0 8.287263 15.628087 0.0 1.0 4.0 9.0 595.0
ANZ_HH_TITEL 794213.0 0.040647 0.324028 0.0 0.0 0.0 0.0 23.0
GEBAEUDETYP 798073.0 2.798641 2.656713 1.0 1.0 1.0 3.0 8.0
KONSUMNAEHE 817252.0 3.018452 1.550312 1.0 2.0 3.0 4.0 7.0
MIN_GEBAEUDEJAHR 798073.0 1993.277011 3.332739 1985.0 1992.0 1992.0 1993.0 2016.0
WOHNLAGE 798073.0 4.052836 1.949539 0.0 3.0 3.0 5.0 8.0
KBA05_ANTG1 757897.0 1.494277 1.403961 0.0 0.0 1.0 3.0 4.0
KBA05_ANTG2 757897.0 1.265584 1.245178 0.0 0.0 1.0 2.0 4.0
KBA05_ANTG3 757897.0 0.624525 1.013443 0.0 0.0 0.0 1.0 3.0
KBA05_ANTG4 757897.0 0.305927 0.638725 0.0 0.0 0.0 0.0 2.0
KBA05_BAUMAX 757897.0 1.389552 1.779483 0.0 0.0 1.0 3.0 5.0
KBA05_GBZ 757897.0 3.158580 1.329537 1.0 2.0 3.0 4.0 5.0
BALLRAUM 797481.0 4.153043 2.183710 1.0 2.0 5.0 6.0 7.0
EWDICHTE 797481.0 3.939172 1.718996 1.0 2.0 4.0 6.0 6.0
INNENSTADT 797481.0 4.549491 2.028919 1.0 3.0 5.0 6.0 8.0
GEBAEUDETYP_RASTER 798066.0 3.738306 0.923193 1.0 3.0 4.0 4.0 5.0
KKK 770025.0 2.592991 1.119052 0.0 2.0 3.0 3.0 4.0
MOBI_REGIO 757897.0 2.963540 1.428882 1.0 2.0 3.0 4.0 6.0
ONLINE_AFFINITAET 886367.0 2.698691 1.521524 0.0 1.0 3.0 4.0 5.0
REGIOTYP 770025.0 4.257967 2.030385 0.0 3.0 5.0 6.0 7.0
KBA13_ANZAHL_PKW 785421.0 619.701439 340.034318 0.0 384.0 549.0 778.0 2300.0
PLZ8_ANTG1 774706.0 2.253330 0.972008 0.0 1.0 2.0 3.0 4.0
PLZ8_ANTG2 774706.0 2.801858 0.920309 0.0 2.0 3.0 3.0 4.0
PLZ8_ANTG3 774706.0 1.595426 0.986736 0.0 1.0 2.0 2.0 3.0
PLZ8_ANTG4 774706.0 0.699166 0.727137 0.0 0.0 1.0 1.0 2.0
PLZ8_BAUMAX 774706.0 1.943913 1.459654 1.0 1.0 1.0 3.0 5.0
PLZ8_HHZ 774706.0 3.612821 0.973967 1.0 3.0 4.0 4.0 5.0
PLZ8_GBZ 774706.0 3.381087 1.111598 1.0 3.0 3.0 4.0 5.0
ARBEIT 794005.0 3.167854 1.002376 1.0 3.0 3.0 4.0 9.0
ORTSGR_KLS9 794005.0 5.293002 2.303739 0.0 4.0 5.0 7.0 9.0
RELAT_AB 794005.0 3.072220 1.362980 1.0 2.0 3.0 4.0 9.0
In [11]:
def value_counts_in_each_column(df):
    """Give value counts for each column
    to see mode and how many categories there are
    """
    for column in df.columns.tolist():
        print(column)
        print(df[column].value_counts())
        print()
In [12]:
value_counts_in_each_column(azdias)
AGER_TYP
-1    677503
 2     98472
 1     79802
 3     27104
 0      8340
Name: AGER_TYP, dtype: int64

ALTERSKATEGORIE_GROB
3    358533
4    228510
2    158410
1    142887
9      2881
Name: ALTERSKATEGORIE_GROB, dtype: int64

ANREDE_KZ
2    465305
1    425916
Name: ANREDE_KZ, dtype: int64

CJT_GESAMTTYP
4.0    210963
3.0    156449
6.0    153915
2.0    148795
5.0    117376
1.0     98869
Name: CJT_GESAMTTYP, dtype: int64

FINANZ_MINIMALIST
3    256276
5    168863
4    167182
2    159313
1    139587
Name: FINANZ_MINIMALIST, dtype: int64

FINANZ_SPARER
1    250213
4    201223
2    153051
5    146380
3    140354
Name: FINANZ_SPARER, dtype: int64

FINANZ_VORSORGER
5    242262
3    229842
4    198218
2    116530
1    104369
Name: FINANZ_VORSORGER, dtype: int64

FINANZ_ANLEGER
5    234508
1    210812
2    161286
4    143597
3    141018
Name: FINANZ_ANLEGER, dtype: int64

FINANZ_UNAUFFAELLIGER
1    220597
5    200551
2    185749
3    170628
4    113696
Name: FINANZ_UNAUFFAELLIGER, dtype: int64

FINANZ_HAUSBAUER
3    235184
5    183918
2    171847
4    157168
1    143104
Name: FINANZ_HAUSBAUER, dtype: int64

FINANZTYP
6    290367
1    199572
4    130625
2    110867
5    106436
3     53354
Name: FINANZTYP, dtype: int64

GEBURTSJAHR
0       392318
1967     11183
1965     11090
1966     10933
1970     10883
1964     10799
1968     10792
1963     10513
1969     10360
1980     10275
1962     10082
1961      9880
1971      9786
1982      9516
1978      9509
1960      9492
1979      9422
1981      9374
1977      9296
1959      9098
1972      9027
1976      9005
1983      8887
1974      8676
1984      8553
1975      8480
1973      8356
1958      8323
1986      8192
1985      8180
1957      8099
1956      8039
1955      7828
1988      7801
1987      7767
1954      7533
1989      7251
1952      7106
1953      7096
1950      7071
1990      6848
1951      6832
1949      6657
1941      6235
1948      5833
1991      5741
1944      5493
1947      5475
1943      5442
1942      5222
1992      5200
1946      4808
1993      4635
1940      4561
1994      4249
1939      4226
1945      4113
1996      4047
1997      4026
1995      4009
1938      3862
1937      3369
1936      3074
1935      2951
1934      2526
1933      1862
1932      1696
1930      1625
1931      1535
1929      1275
1928      1180
1927      1021
1926       865
1998       830
2012       806
2002       782
2000       753
2001       748
1999       744
1925       728
2003       707
2004       679
1924       603
2017       593
2005       592
2006       570
2007       567
2009       559
2008       550
2010       545
2011       485
1923       468
2013       380
1922       375
1921       355
2015       257
1920       238
1919       194
2016       167
2014       124
1918        85
1914        55
1917        55
1916        45
1910        41
1913        39
1915        37
1911        30
1912        28
1905         8
1908         7
1906         7
1909         7
1904         5
1907         4
1900         4
1902         1
Name: GEBURTSJAHR, dtype: int64

GFK_URLAUBERTYP
12.0    138545
5.0     120126
10.0    109127
8.0      88042
11.0     79740
4.0      63770
9.0      60614
3.0      56007
1.0      53600
2.0      46702
7.0      42956
6.0      27138
Name: GFK_URLAUBERTYP, dtype: int64

GREEN_AVANTGARDE
0    715996
1    175225
Name: GREEN_AVANTGARDE, dtype: int64

HEALTH_TYP
 3    310693
 2    306944
 1    162388
-1    111196
Name: HEALTH_TYP, dtype: int64

LP_LEBENSPHASE_FEIN
0.0     92778
1.0     62667
5.0     55542
6.0     45614
2.0     39434
8.0     30475
11.0    26710
29.0    26577
7.0     26508
13.0    26085
10.0    25789
31.0    23987
12.0    23300
30.0    22361
15.0    20062
3.0     19985
19.0    19484
37.0    18525
4.0     17595
14.0    17529
20.0    17132
32.0    17105
39.0    16182
40.0    15150
27.0    14475
16.0    14466
38.0    13914
35.0    13679
34.0    13074
9.0     13066
21.0    12766
28.0    12264
24.0    12091
36.0    10505
25.0    10370
23.0     9191
22.0     7224
18.0     7168
33.0     6066
17.0     5888
26.0     3584
Name: LP_LEBENSPHASE_FEIN, dtype: int64

LP_LEBENSPHASE_GROB
2.0     158139
1.0     139681
3.0     115624
0.0      89718
12.0     74276
4.0      54443
5.0      49672
9.0      48938
10.0     41092
11.0     32819
8.0      30323
6.0      29181
7.0      22461
Name: LP_LEBENSPHASE_GROB, dtype: int64

LP_FAMILIE_FEIN
1.0     426379
10.0    137913
2.0     104305
0.0      72938
11.0     51719
8.0      23032
7.0      20730
4.0      12303
5.0      11920
9.0      11148
6.0       9022
3.0       4958
Name: LP_FAMILIE_FEIN, dtype: int64

LP_FAMILIE_GROB
1.0    426379
5.0    200780
2.0    104305
0.0     72938
4.0     52784
3.0     29181
Name: LP_FAMILIE_GROB, dtype: int64

LP_STATUS_FEIN
1.0     219275
9.0     143238
2.0     118236
10.0    118022
4.0      78317
5.0      74493
3.0      74105
6.0      30914
8.0      19708
7.0      10059
Name: LP_STATUS_FEIN, dtype: int64

LP_STATUS_GROB
1.0    337511
2.0    226915
4.0    162946
5.0    118022
3.0     40973
Name: LP_STATUS_GROB, dtype: int64

NATIONALITAET_KZ
1    684085
0    108315
2     65418
3     33403
Name: NATIONALITAET_KZ, dtype: int64

PRAEGENDE_JUGENDJAHRE
14    188697
8     145988
0     108164
5      86416
10     85808
3      55195
15     42547
11     35752
9      33570
6      25652
12     24446
1      21282
4      20451
2       7479
13      5764
7       4010
Name: PRAEGENDE_JUGENDJAHRE, dtype: int64

RETOURTYP_BK_S
5.0    297993
3.0    231816
4.0    131115
1.0    129712
2.0     95731
Name: RETOURTYP_BK_S, dtype: int64

SEMIO_SOZ
2    244714
6    136205
5    121786
3    118889
7    117378
4     90161
1     62088
Name: SEMIO_SOZ, dtype: int64

SEMIO_FAM
6    186729
2    139562
4    135942
5    133740
7    118517
3     94815
1     81916
Name: SEMIO_FAM, dtype: int64

SEMIO_REL
7    211377
4    207128
3    150801
1    108130
5     79566
2     73127
6     61092
Name: SEMIO_REL, dtype: int64

SEMIO_MAT
5    171267
4    162862
2    134549
3    123701
7    111976
1     97341
6     89525
Name: SEMIO_MAT, dtype: int64

SEMIO_VERT
2    204333
6    141714
5    135205
7    134756
4    122982
1    120437
3     31794
Name: SEMIO_VERT, dtype: int64

SEMIO_LUST
5    170040
6    158624
7    158234
2    114373
1    110382
4     97495
3     82073
Name: SEMIO_LUST, dtype: int64

SEMIO_ERL
4    196206
3    180824
7    179141
6    139209
2     77012
5     76133
1     42696
Name: SEMIO_ERL, dtype: int64

SEMIO_KULT
3    209067
5    176282
1    128216
7    117378
4    101502
6    101286
2     57490
Name: SEMIO_KULT, dtype: int64

SEMIO_RAT
4    334456
2    140433
3    131994
5     89056
7     87024
6     61484
1     46774
Name: SEMIO_RAT, dtype: int64

SEMIO_KRIT
7    219847
5    156298
4    144079
6    133049
3    129106
1     54947
2     53895
Name: SEMIO_KRIT, dtype: int64

SEMIO_DOM
6    183435
5    177889
7    161495
4    125115
2    101498
3     97027
1     44762
Name: SEMIO_DOM, dtype: int64

SEMIO_KAEM
6    206001
3    180955
7    135579
5    128501
2    114038
4     78944
1     47203
Name: SEMIO_KAEM, dtype: int64

SEMIO_PFLICHT
5    203845
4    162117
3    133990
7    115458
6    109442
2     92214
1     74155
Name: SEMIO_PFLICHT, dtype: int64

SEMIO_TRADV
3    226571
4    174203
2    132657
5    117378
1     96775
7     76133
6     67504
Name: SEMIO_TRADV, dtype: int64

SHOPPER_TYP
 1    254761
 2    207463
 3    190219
 0    127582
-1    111196
Name: SHOPPER_TYP, dtype: int64

SOHO_KZ
0.0    810834
1.0      6888
Name: SOHO_KZ, dtype: int64

TITEL_KZ
0.0    815562
1.0      1947
5.0       104
4.0        57
3.0        49
2.0         3
Name: TITEL_KZ, dtype: int64

VERS_TYP
 2    398722
 1    381303
-1    111196
Name: VERS_TYP, dtype: int64

ZABEOTYP
3    364905
4    210095
1    123622
5     84956
6     74473
2     33170
Name: ZABEOTYP, dtype: int64

ALTER_HH
0.0     236768
18.0     60852
17.0     55665
19.0     52890
15.0     51867
16.0     51857
14.0     44275
21.0     41610
20.0     40671
13.0     37612
12.0     34923
10.0     30419
11.0     27924
9.0      22817
8.0      13463
7.0       8419
6.0       3809
5.0       1030
4.0        603
3.0        200
2.0         47
1.0          1
Name: ALTER_HH, dtype: int64

ANZ_PERSONEN
1.0     423383
2.0     195470
3.0      94905
4.0      47126
0.0      34103
5.0      15503
6.0       4842
7.0       1525
8.0        523
9.0        180
10.0        67
11.0        38
12.0        16
13.0        11
21.0         4
14.0         4
20.0         3
15.0         3
38.0         2
23.0         2
37.0         2
22.0         2
35.0         1
17.0         1
16.0         1
45.0         1
18.0         1
40.0         1
29.0         1
31.0         1
Name: ANZ_PERSONEN, dtype: int64

ANZ_TITEL
0.0    814542
1.0      2970
2.0       202
3.0         5
4.0         2
6.0         1
Name: ANZ_TITEL, dtype: int64

HH_EINKOMMEN_SCORE
6.0    252775
5.0    201482
2.0    140817
4.0    139762
3.0     84805
1.0     53232
Name: HH_EINKOMMEN_SCORE, dtype: int64

KK_KUNDENTYP
3.0    65151
2.0    62564
5.0    48038
4.0    44512
6.0    44114
1.0    42230
Name: KK_KUNDENTYP, dtype: int64

W_KEIT_KIND_HH
6.0    281966
4.0    128675
3.0    100170
2.0     84000
1.0     83706
5.0     64716
0.0     40386
Name: W_KEIT_KIND_HH, dtype: int64

WOHNDAUER_2008
9.0    551176
8.0     80118
4.0     50736
3.0     38767
6.0     35170
5.0     30959
7.0     23939
2.0      6174
1.0       683
Name: WOHNDAUER_2008, dtype: int64

ANZ_HAUSHALTE_AKTIV
1.0      195957
2.0      120982
3.0       62575
4.0       43213
5.0       37815
6.0       36020
7.0       34526
8.0       32293
9.0       29002
10.0      25428
11.0      21965
12.0      18033
13.0      15282
14.0      12625
15.0      10371
16.0       8899
17.0       7292
0.0        6463
18.0       6324
19.0       5461
20.0       4674
21.0       4138
22.0       3735
23.0       3243
24.0       2838
25.0       2636
26.0       2342
27.0       2232
28.0       2040
29.0       1963
30.0       1821
31.0       1634
32.0       1616
33.0       1551
34.0       1437
35.0       1320
37.0       1317
36.0       1281
38.0       1255
40.0       1115
39.0       1113
42.0       1043
41.0       1031
44.0        829
43.0        819
46.0        771
45.0        686
47.0        665
48.0        662
49.0        609
50.0        548
52.0        538
51.0        474
55.0        436
54.0        436
53.0        427
56.0        399
58.0        394
57.0        376
61.0        371
59.0        346
64.0        312
62.0        308
60.0        295
63.0        291
67.0        248
70.0        248
68.0        246
73.0        245
66.0        244
72.0        226
65.0        201
69.0        192
71.0        188
77.0        182
75.0        171
74.0        171
80.0        150
76.0        149
82.0        138
79.0        136
85.0        131
84.0        131
81.0        130
78.0        128
91.0        123
83.0        123
86.0        122
87.0        119
89.0        117
93.0        116
92.0        114
88.0        107
90.0        101
97.0         99
102.0        96
95.0         91
98.0         91
103.0        90
99.0         86
          ...  
348.0        11
163.0        11
265.0        11
236.0        10
184.0        10
218.0        10
190.0        10
258.0        10
194.0        10
277.0         9
172.0         9
367.0         9
314.0         9
438.0         9
189.0         9
209.0         9
151.0         9
274.0         9
221.0         9
219.0         9
344.0         9
321.0         8
595.0         8
177.0         8
333.0         8
169.0         8
200.0         8
207.0         8
185.0         8
318.0         8
252.0         8
346.0         7
242.0         7
267.0         7
353.0         7
276.0         7
328.0         7
286.0         7
211.0         7
206.0         7
241.0         7
208.0         7
201.0         7
445.0         7
191.0         7
317.0         6
304.0         6
202.0         6
215.0         6
379.0         6
259.0         6
193.0         6
430.0         6
283.0         6
214.0         6
290.0         6
166.0         6
204.0         6
243.0         6
377.0         5
205.0         5
231.0         5
197.0         5
280.0         5
263.0         5
229.0         5
240.0         5
228.0         5
316.0         5
326.0         4
285.0         4
515.0         4
523.0         4
301.0         4
249.0         4
174.0         4
266.0         4
256.0         4
255.0         4
250.0         4
260.0         4
331.0         4
226.0         3
224.0         3
168.0         3
307.0         3
414.0         3
244.0         3
378.0         3
293.0         3
272.0         3
395.0         3
237.0         2
254.0         2
404.0         2
213.0         2
366.0         1
536.0         1
232.0         1
220.0         1
Name: ANZ_HAUSHALTE_AKTIV, Length: 292, dtype: int64

ANZ_HH_TITEL
0.0     770244
1.0      20157
2.0       2459
3.0        585
4.0        232
5.0        117
6.0        106
8.0         68
7.0         65
9.0         34
13.0        29
12.0        22
11.0        22
14.0        16
10.0        16
17.0        13
20.0         9
15.0         7
18.0         6
16.0         3
23.0         3
Name: ANZ_HH_TITEL, dtype: int64

GEBAEUDETYP
1.0    460465
3.0    178668
8.0    152476
2.0      4935
4.0       900
6.0       628
5.0         1
Name: GEBAEUDETYP, dtype: int64

KONSUMNAEHE
1.0    193738
3.0    171127
5.0    153535
2.0    134665
4.0    133324
6.0     26625
7.0      4238
Name: KONSUMNAEHE, dtype: int64

MIN_GEBAEUDEJAHR
1992.0    568776
1994.0     78835
1993.0     25488
1995.0     25464
1996.0     16611
1997.0     14464
2000.0      7382
2001.0      5877
1991.0      5811
2005.0      5553
1999.0      4413
1990.0      4408
2002.0      4216
1998.0      4097
2003.0      3356
2004.0      2935
2008.0      2197
2007.0      2156
1989.0      2046
2009.0      2016
2006.0      1984
2011.0      1903
2012.0      1861
2010.0      1410
2013.0      1230
1988.0      1027
2014.0      1001
2015.0       717
1987.0       470
2016.0       128
1986.0       125
1985.0       116
Name: MIN_GEBAEUDEJAHR, dtype: int64

OST_WEST_KZ
W    629528
O    168545
Name: OST_WEST_KZ, dtype: int64

WOHNLAGE
3.0    249719
7.0    169318
4.0    135973
2.0    100376
5.0     74346
1.0     43918
8.0     17473
0.0      6950
Name: WOHNLAGE, dtype: int64

CAMEO_DEUG_2015
8    134441
9    108177
6    105874
4    103912
3     86779
2     83231
7     77933
5     55310
1     36212
X       373
Name: CAMEO_DEUG_2015, dtype: int64

CAMEO_DEU_2015
6B    56672
8A    52438
4C    47819
2D    35074
3C    34769
7A    34399
3D    34307
8B    33434
4A    33155
8C    30993
9D    28593
9B    27676
9C    24987
7B    24503
9A    20542
2C    19422
8D    17576
6E    16107
2B    15486
5D    14943
6C    14820
2A    13249
5A    12214
1D    11909
1A    10850
3A    10543
5B    10354
5C     9935
7C     9065
4B     9047
4D     8570
3B     7160
6A     6810
9E     6379
6D     6073
6F     5392
7D     5333
4E     5321
1E     5065
7E     4633
1C     4317
5F     4283
1B     4071
5E     3581
XX      373
Name: CAMEO_DEU_2015, dtype: int64

CAMEO_INTL_2015
51    133694
41     92336
24     91158
14     62884
43     56672
54     45391
25     39628
22     33155
23     26750
13     26336
45     26132
55     23955
52     20542
31     19024
34     18524
15     16974
44     14820
12     13249
35     10356
32     10354
33      9935
XX       373
Name: CAMEO_INTL_2015, dtype: int64

KBA05_ANTG1
0.0    261049
1.0    161224
2.0    126725
3.0    117762
4.0     91137
Name: KBA05_ANTG1, dtype: int64

KBA05_ANTG2
0.0    292538
1.0    163751
2.0    138273
3.0    134455
4.0     28880
Name: KBA05_ANTG2, dtype: int64

KBA05_ANTG3
0.0    511545
1.0     92748
2.0     80234
3.0     73370
Name: KBA05_ANTG3, dtype: int64

KBA05_ANTG4
0.0    600171
1.0     83591
2.0     74135
Name: KBA05_ANTG4, dtype: int64

KBA05_BAUMAX
0.0    343200
1.0    208417
5.0     98923
3.0     59955
4.0     37718
2.0      9684
Name: KBA05_BAUMAX, dtype: int64

KBA05_GBZ
3.0    197833
5.0    158971
4.0    155301
2.0    138528
1.0    107264
Name: KBA05_GBZ, dtype: int64

BALLRAUM
6.0    255093
1.0    151782
2.0    104521
7.0     99039
3.0     73277
4.0     61358
5.0     52411
Name: BALLRAUM, dtype: int64

EWDICHTE
6.0    201009
5.0    161209
2.0    139087
4.0    130716
1.0     84051
3.0     81409
Name: EWDICHTE, dtype: int64

INNENSTADT
5.0    147626
4.0    134067
6.0    111679
2.0    109048
3.0     92818
8.0     82870
7.0     67463
1.0     51910
Name: INNENSTADT, dtype: int64

GEBAEUDETYP_RASTER
4.0    359620
3.0    205330
5.0    159217
2.0     58961
1.0     14938
Name: GEBAEUDETYP_RASTER, dtype: int64

KKK
3.0    273024
2.0    181519
4.0    178648
1.0     99966
0.0     36868
Name: KKK, dtype: int64

MOBI_REGIO
1.0    163993
3.0    150336
5.0    148713
4.0    148209
2.0    146305
6.0       341
Name: MOBI_REGIO, dtype: int64

ONLINE_AFFINITAET
2.0    197850
4.0    164704
3.0    163487
1.0    156499
5.0    138111
0.0     65716
Name: ONLINE_AFFINITAET, dtype: int64

REGIOTYP
6.0    195286
5.0    145359
3.0     93929
2.0     91662
7.0     83943
4.0     68180
1.0     54798
0.0     36868
Name: REGIOTYP, dtype: int64

KBA13_ANZAHL_PKW
1400.0    11722
1500.0     8291
1300.0     6427
1600.0     6135
1700.0     3795
1800.0     2617
464.0      1604
417.0      1604
519.0      1600
534.0      1496
386.0      1458
1900.0     1450
395.0      1446
481.0      1417
455.0      1409
483.0      1393
452.0      1388
418.0      1384
454.0      1380
450.0      1380
494.0      1379
459.0      1379
492.0      1359
504.0      1340
387.0      1338
420.0      1337
439.0      1327
506.0      1326
388.0      1324
456.0      1323
487.0      1319
402.0      1318
421.0      1317
499.0      1310
466.0      1308
491.0      1302
490.0      1302
558.0      1301
477.0      1298
489.0      1296
567.0      1292
536.0      1290
406.0      1288
516.0      1284
393.0      1282
453.0      1282
389.0      1281
390.0      1280
556.0      1278
584.0      1278
438.0      1276
574.0      1274
485.0      1273
537.0      1267
517.0      1266
479.0      1264
508.0      1262
497.0      1262
377.0      1257
478.0      1256
500.0      1255
352.0      1254
572.0      1254
375.0      1254
467.0      1254
446.0      1253
409.0      1252
451.0      1245
429.0      1245
384.0      1243
515.0      1242
426.0      1241
434.0      1240
518.0      1239
470.0      1232
554.0      1232
488.0      1232
471.0      1229
369.0      1228
382.0      1225
360.0      1224
442.0      1223
410.0      1221
502.0      1220
430.0      1220
399.0      1218
396.0      1217
345.0      1215
509.0      1214
428.0      1213
565.0      1212
380.0      1207
475.0      1205
597.0      1203
549.0      1201
412.0      1201
469.0      1200
530.0      1200
457.0      1200
437.0      1198
          ...  
1168.0      113
1177.0      112
76.0        111
1198.0      107
88.0        107
1074.0      103
1221.0      103
69.0        101
71.0         99
1133.0       98
100.0        97
64.0         95
1224.0       95
80.0         94
70.0         91
73.0         89
74.0         89
1233.0       87
1122.0       85
1213.0       85
75.0         85
51.0         84
78.0         84
68.0         84
63.0         80
1093.0       78
47.0         77
59.0         76
1247.0       76
52.0         75
1184.0       75
77.0         74
72.0         74
1115.0       73
1185.0       71
66.0         69
1098.0       67
65.0         67
62.0         66
67.0         66
0.0          62
1225.0       61
58.0         60
45.0         59
61.0         58
1232.0       57
60.0         54
56.0         53
55.0         53
53.0         53
48.0         53
44.0         52
54.0         50
46.0         46
57.0         46
37.0         45
41.0         44
35.0         43
34.0         42
40.0         42
36.0         38
50.0         37
42.0         35
38.0         35
33.0         31
31.0         29
39.0         28
32.0         28
49.0         27
43.0         27
28.0         24
27.0         24
25.0         23
24.0         22
26.0         21
18.0         21
17.0         20
20.0         18
21.0         17
22.0         16
12.0         16
14.0         16
29.0         15
15.0         14
23.0         13
30.0         12
16.0         11
19.0         11
13.0         10
1.0           8
10.0          8
11.0          7
5.0           7
9.0           7
4.0           7
3.0           6
8.0           6
2.0           6
7.0           5
6.0           5
Name: KBA13_ANZAHL_PKW, Length: 1261, dtype: int64

PLZ8_ANTG1
2.0    270590
3.0    222355
1.0    189247
4.0     87044
0.0      5470
Name: PLZ8_ANTG1, dtype: int64

PLZ8_ANTG2
3.0    307283
2.0    215767
4.0    191005
1.0     53213
0.0      7438
Name: PLZ8_ANTG2, dtype: int64

PLZ8_ANTG3
2.0    252994
1.0    237878
3.0    164040
0.0    119794
Name: PLZ8_ANTG3, dtype: int64

PLZ8_ANTG4
0.0    356389
1.0    294986
2.0    123331
Name: PLZ8_ANTG4, dtype: int64

PLZ8_BAUMAX
1.0    499550
5.0     97333
2.0     70407
4.0     56684
3.0     50732
Name: PLZ8_BAUMAX, dtype: int64

PLZ8_HHZ
3.0    309146
4.0    211911
5.0    175813
2.0     66891
1.0     10945
Name: PLZ8_HHZ, dtype: int64

PLZ8_GBZ
3.0    288383
4.0    180252
5.0    153883
2.0    111588
1.0     40600
Name: PLZ8_GBZ, dtype: int64

ARBEIT
4.0    311339
3.0    254988
2.0    135662
1.0     56767
5.0     35090
9.0       159
Name: ARBEIT, dtype: int64

ORTSGR_KLS9
5.0    148096
4.0    114909
7.0    102866
9.0     91879
3.0     83542
6.0     75995
8.0     72709
2.0     63362
1.0     40589
0.0        58
Name: ORTSGR_KLS9, dtype: int64

RELAT_AB
3.0    274008
5.0    174964
1.0    142907
2.0    104846
4.0     97121
9.0       159
Name: RELAT_AB, dtype: int64

In [13]:
def bar_graph_for_each_column(df,column_list='all'):
    """Loop thru all columns in the dataset. Give frequencies of values in columns.
    
    df - you data
    
    column_list = which columns do you want to look at (pass a list). Default = "all"

    Y axis - unique values in column
    X axis - frequency of that value

    https://etav.github.io/python/count_basic_freq_plot.html
    https://stackoverflow.com/questions/38152356/matplotlib-dollar-sign-with-thousands-comma-tick-labels/38152510
    https://stackoverflow.com/questions/50169311/getting-percentage-and-count-python
    """
    
    if column_list == 'all':
        column_list = df.columns.tolist()
    
    for column in column_list:
        fig, ax = plt.subplots(1, 1,figsize=(10, 5))
        df[column].value_counts().plot('barh').invert_yaxis() 

        plt.title(column)

        ax.set_ylabel('Value')
        ax.set_xlabel('Frequency')

        fmt = '{x:,.0f}'
        tick = mtick.StrMethodFormatter(fmt)
        ax.xaxis.set_major_formatter(tick) 

        plt.show()  
           
#         print(df[column].value_counts())
        
        summary = pd.concat([df[column].value_counts(), df[column].value_counts(normalize=True).mul(100)],axis=1, keys=('counts','percentage'))
        print(column)
        print(summary)
                             
bar_graph_for_each_column(azdias,column_list=['AGER_TYP', 'RELAT_AB'])                              
AGER_TYP
    counts  percentage
-1  677503   76.019640
 2   98472   11.049111
 1   79802    8.954232
 3   27104    3.041221
 0    8340    0.935795
RELAT_AB
     counts  percentage
3.0  274008   34.509606
5.0  174964   22.035629
1.0  142907   17.998249
2.0  104846   13.204703
4.0   97121   12.231787
9.0     159    0.020025
In [14]:
# bar_graph_for_each_column?
In [15]:
bar_graph_for_each_column(azdias)  
AGER_TYP
    counts  percentage
-1  677503   76.019640
 2   98472   11.049111
 1   79802    8.954232
 3   27104    3.041221
 0    8340    0.935795
ALTERSKATEGORIE_GROB
   counts  percentage
3  358533   40.229416
4  228510   25.640105
2  158410   17.774491
1  142887   16.032724
9    2881    0.323264
ANREDE_KZ
   counts  percentage
2  465305   52.209833
1  425916   47.790167
CJT_GESAMTTYP
     counts  percentage
4.0  210963   23.800864
3.0  156449   17.650589
6.0  153915   17.364703
2.0  148795   16.787065
5.0  117376   13.242370
1.0   98869   11.154409
FINANZ_MINIMALIST
   counts  percentage
3  256276   28.755606
5  168863   18.947377
4  167182   18.758759
2  159313   17.875813
1  139587   15.662445
FINANZ_SPARER
   counts  percentage
1  250213   28.075303
4  201223   22.578350
2  153051   17.173182
5  146380   16.424658
3  140354   15.748507
FINANZ_VORSORGER
   counts  percentage
5  242262   27.183157
3  229842   25.789563
4  198218   22.241173
2  116530   13.075320
1  104369   11.710788
FINANZ_ANLEGER
   counts  percentage
5  234508   26.313114
1  210812   23.654290
2  161286   18.097195
4  143597   16.112390
3  141018   15.823011
FINANZ_UNAUFFAELLIGER
   counts  percentage
1  220597   24.752222
5  200551   22.502948
2  185749   20.842081
3  170628   19.145420
4  113696   12.757330
FINANZ_HAUSBAUER
   counts  percentage
3  235184   26.388965
5  183918   20.636632
2  171847   19.282198
4  157168   17.635132
1  143104   16.057072
FINANZTYP
   counts  percentage
6  290367   32.580808
1  199572   22.393099
4  130625   14.656858
2  110867   12.439900
5  106436   11.942717
3   53354    5.986618
GEBURTSJAHR
      counts  percentage
0     392318   44.020282
1967   11183    1.254795
1965   11090    1.244360
1966   10933    1.226744
1970   10883    1.221134
1964   10799    1.211708
1968   10792    1.210923
1963   10513    1.179618
1969   10360    1.162450
1980   10275    1.152913
1962   10082    1.131257
1961    9880    1.108591
1971    9786    1.098044
1982    9516    1.067749
1978    9509    1.066963
1960    9492    1.065056
1979    9422    1.057201
1981    9374    1.051815
1977    9296    1.043063
1959    9098    1.020847
1972    9027    1.012880
1976    9005    1.010412
1983    8887    0.997171
1974    8676    0.973496
1984    8553    0.959695
1975    8480    0.951504
1973    8356    0.937590
1958    8323    0.933887
1986    8192    0.919188
1985    8180    0.917842
1957    8099    0.908753
1956    8039    0.902021
1955    7828    0.878346
1988    7801    0.875316
1987    7767    0.871501
1954    7533    0.845245
1989    7251    0.813603
1952    7106    0.797333
1953    7096    0.796211
1950    7071    0.793406
1990    6848    0.768384
1951    6832    0.766589
1949    6657    0.746953
1941    6235    0.699602
1948    5833    0.654495
1991    5741    0.644172
1944    5493    0.616345
1947    5475    0.614326
1943    5442    0.610623
1942    5222    0.585938
1992    5200    0.583469
1946    4808    0.539485
1993    4635    0.520073
1940    4561    0.511770
1994    4249    0.476762
1939    4226    0.474181
1945    4113    0.461502
1996    4047    0.454096
1997    4026    0.451740
1995    4009    0.449832
1938    3862    0.433338
1937    3369    0.378021
1936    3074    0.344920
1935    2951    0.331119
1934    2526    0.283431
1933    1862    0.208927
1932    1696    0.190301
1930    1625    0.182334
1931    1535    0.172236
1929    1275    0.143062
1928    1180    0.132403
1927    1021    0.114562
1926     865    0.097058
1998     830    0.093131
2012     806    0.090438
2002     782    0.087745
2000     753    0.084491
2001     748    0.083930
1999     744    0.083481
1925     728    0.081686
2003     707    0.079329
2004     679    0.076188
1924     603    0.067660
2017     593    0.066538
2005     592    0.066426
2006     570    0.063957
2007     567    0.063621
2009     559    0.062723
2008     550    0.061713
2010     545    0.061152
2011     485    0.054420
1923     468    0.052512
2013     380    0.042638
1922     375    0.042077
1921     355    0.039833
2015     257    0.028837
1920     238    0.026705
1919     194    0.021768
2016     167    0.018738
2014     124    0.013913
1918      85    0.009537
1914      55    0.006171
1917      55    0.006171
1916      45    0.005049
1910      41    0.004600
1913      39    0.004376
1915      37    0.004152
1911      30    0.003366
1912      28    0.003142
1905       8    0.000898
1908       7    0.000785
1906       7    0.000785
1909       7    0.000785
1904       5    0.000561
1907       4    0.000449
1900       4    0.000449
1902       1    0.000112
GFK_URLAUBERTYP
      counts  percentage
12.0  138545   15.630659
5.0   120126   13.552625
10.0  109127   12.311717
8.0    88042    9.932906
11.0   79740    8.996274
4.0    63770    7.194537
9.0    60614    6.838477
3.0    56007    6.318714
1.0    53600    6.047157
2.0    46702    5.268924
7.0    42956    4.846300
6.0    27138    3.061711
GREEN_AVANTGARDE
   counts  percentage
0  715996   80.338771
1  175225   19.661229
HEALTH_TYP
    counts  percentage
 3  310693   34.861499
 2  306944   34.440840
 1  162388   18.220845
-1  111196   12.476816
LP_LEBENSPHASE_FEIN
      counts  percentage
0.0    92778   10.467222
1.0    62667    7.070096
5.0    55542    6.266253
6.0    45614    5.146175
2.0    39434    4.448947
8.0    30475    3.438192
11.0   26710    3.013424
29.0   26577    2.998419
7.0    26508    2.990635
13.0   26085    2.942912
10.0   25789    2.909517
31.0   23987    2.706215
12.0   23300    2.628708
30.0   22361    2.522770
15.0   20062    2.263397
3.0    19985    2.254709
19.0   19484    2.198187
37.0   18525    2.089992
4.0    17595    1.985069
14.0   17529    1.977623
20.0   17132    1.932834
32.0   17105    1.929788
39.0   16182    1.825655
40.0   15150    1.709224
27.0   14475    1.633071
16.0   14466    1.632055
38.0   13914    1.569779
35.0   13679    1.543266
34.0   13074    1.475010
9.0    13066    1.474107
21.0   12766    1.440261
28.0   12264    1.383626
24.0   12091    1.364108
36.0   10505    1.185175
25.0   10370    1.169944
23.0    9191    1.036929
22.0    7224    0.815012
18.0    7168    0.808694
33.0    6066    0.684367
17.0    5888    0.664285
26.0    3584    0.404347
LP_LEBENSPHASE_GROB
      counts  percentage
2.0   158139   17.841255
1.0   139681   15.758822
3.0   115624   13.044709
0.0    89718   10.121992
12.0   74276    8.379825
4.0    54443    6.142264
5.0    49672    5.603999
9.0    48938    5.521189
10.0   41092    4.636003
11.0   32819    3.702642
8.0    30323    3.421043
6.0    29181    3.292203
7.0    22461    2.534052
LP_FAMILIE_FEIN
      counts  percentage
1.0   426379   48.104115
10.0  137913   15.559356
2.0   104305   11.767699
0.0    72938    8.228871
11.0   51719    5.834942
8.0    23032    2.598472
7.0    20730    2.338760
4.0    12303    1.388026
5.0    11920    1.344815
9.0    11148    1.257718
6.0     9022    1.017863
3.0     4958    0.559362
LP_FAMILIE_GROB
     counts  percentage
1.0  426379   48.104115
5.0  200780   22.652017
2.0  104305   11.767699
0.0   72938    8.228871
4.0   52784    5.955095
3.0   29181    3.292203
LP_STATUS_FEIN
      counts  percentage
1.0   219275   24.738624
9.0   143238   16.160123
2.0   118236   13.339396
10.0  118022   13.315252
4.0    78317    8.835731
5.0    74493    8.404307
3.0    74105    8.360532
6.0    30914    3.487720
8.0    19708    2.223458
7.0    10059    1.134857
LP_STATUS_GROB
     counts  percentage
1.0  337511   38.078020
2.0  226915   25.600570
4.0  162946   18.383582
5.0  118022   13.315252
3.0   40973    4.622577
NATIONALITAET_KZ
   counts  percentage
1  684085   76.758178
0  108315   12.153551
2   65418    7.340267
3   33403    3.748004
PRAEGENDE_JUGENDJAHRE
    counts  percentage
14  188697   21.172863
8   145988   16.380673
0   108164   12.136608
5    86416    9.696360
10   85808    9.628139
3    55195    6.193189
15   42547    4.774012
11   35752    4.011575
9    33570    3.766742
6    25652    2.878298
12   24446    2.742978
1    21282    2.387960
4    20451    2.294717
2     7479    0.839186
13    5764    0.646753
7     4010    0.449945
RETOURTYP_BK_S
     counts  percentage
5.0  297993   33.619595
3.0  231816   26.153501
4.0  131115   14.792405
1.0  129712   14.634119
2.0   95731   10.800380
SEMIO_SOZ
   counts  percentage
2  244714   27.458285
6  136205   15.282966
5  121786   13.665073
3  118889   13.340013
7  117378   13.170471
4   90161   10.116570
1   62088    6.966622
SEMIO_FAM
   counts  percentage
6  186729   20.952042
2  139562   15.659640
4  135942   15.253456
5  133740   15.006379
7  118517   13.298273
3   94815   10.638775
1   81916    9.191435
SEMIO_REL
   counts  percentage
7  211377   23.717686
4  207128   23.240925
3  150801   16.920719
1  108130   12.132793
5   79566    8.927752
2   73127    8.205260
6   61092    6.854865
SEMIO_MAT
   counts  percentage
5  171267   19.217119
4  162862   18.274031
2  134549   15.097153
3  123701   13.879947
7  111976   12.564336
1   97341   10.922207
6   89525   10.045208
SEMIO_VERT
   counts  percentage
2  204333   22.927310
6  141714   15.901106
5  135205   15.170760
7  134756   15.120380
4  122982   13.799271
1  120437   13.513708
3   31794    3.567465
SEMIO_LUST
   counts  percentage
5  170040   19.079443
6  158624   17.798503
7  158234   17.754743
2  114373   12.833293
1  110382   12.385480
4   97495   10.939486
3   82073    9.209051
SEMIO_ERL
   counts  percentage
4  196206   22.015415
3  180824   20.289468
7  179141   20.100626
6  139209   15.620031
2   77012    8.641179
5   76133    8.542550
1   42696    4.790731
SEMIO_KULT
   counts  percentage
3  209067   23.458491
5  176282   19.779830
1  128216   14.386555
7  117378   13.170471
4  101502   11.389094
6  101286   11.364858
2   57490    6.450701
SEMIO_RAT
   counts  percentage
4  334456   37.527841
2  140433   15.757371
3  131994   14.810468
5   89056    9.992583
7   87024    9.764581
6   61484    6.898850
1   46774    5.248305
SEMIO_KRIT
   counts  percentage
7  219847   24.668068
5  156298   17.537513
4  144079   16.166473
6  133049   14.928845
3  129106   14.486418
1   54947    6.165362
2   53895    6.047322
SEMIO_DOM
   counts  percentage
6  183435   20.582437
5  177889   19.960145
7  161495   18.120646
4  125115   14.038605
2  101498   11.388645
3   97027   10.886974
1   44762    5.022548
SEMIO_KAEM
   counts  percentage
6  206001   23.114469
3  180955   20.304167
7  135579   15.212725
5  128501   14.418534
2  114038   12.795704
4   78944    8.857960
1   47203    5.296442
SEMIO_PFLICHT
   counts  percentage
5  203845   22.872553
4  162117   18.190438
3  133990   15.034430
7  115458   12.955036
6  109442   12.280007
2   92214   10.346929
1   74155    8.320607
SEMIO_TRADV
   counts  percentage
3  226571   25.422538
4  174203   19.546555
2  132657   14.884860
5  117378   13.170471
1   96775   10.858698
7   76133    8.542550
6   67504    7.574328
SHOPPER_TYP
    counts  percentage
 1  254761   28.585615
 2  207463   23.278513
 3  190219   21.343640
 0  127582   14.315417
-1  111196   12.476816
SOHO_KZ
     counts  percentage
0.0  810834    99.15766
1.0    6888     0.84234
TITEL_KZ
     counts  percentage
0.0  815562   99.735852
1.0    1947    0.238100
5.0     104    0.012718
4.0      57    0.006971
3.0      49    0.005992
2.0       3    0.000367
VERS_TYP
    counts  percentage
 2  398722   44.738847
 1  381303   42.784337
-1  111196   12.476816
ZABEOTYP
   counts  percentage
3  364905   40.944390
4  210095   23.573839
1  123622   13.871082
5   84956    9.532540
6   74473    8.356289
2   33170    3.721860
ALTER_HH
      counts  percentage
0.0   236768   28.954584
18.0   60852    7.441649
17.0   55665    6.807326
19.0   52890    6.467968
15.0   51867    6.342865
16.0   51857    6.341642
14.0   44275    5.414432
21.0   41610    5.088526
20.0   40671    4.973695
13.0   37612    4.599607
12.0   34923    4.270767
10.0   30419    3.719968
11.0   27924    3.414852
9.0    22817    2.790313
8.0    13463    1.646403
7.0     8419    1.029568
6.0     3809    0.465806
5.0     1030    0.125960
4.0      603    0.073741
3.0      200    0.024458
2.0       47    0.005748
1.0        1    0.000122
ANZ_PERSONEN
      counts  percentage
1.0   423383   51.775909
2.0   195470   23.904212
3.0    94905   11.606023
4.0    47126    5.763083
0.0    34103    4.170488
5.0    15503    1.895877
6.0     4842    0.592133
7.0     1525    0.186494
8.0      523    0.063958
9.0      180    0.022012
10.0      67    0.008193
11.0      38    0.004647
12.0      16    0.001957
13.0      11    0.001345
21.0       4    0.000489
14.0       4    0.000489
20.0       3    0.000367
15.0       3    0.000367
38.0       2    0.000245
23.0       2    0.000245
37.0       2    0.000245
22.0       2    0.000245
35.0       1    0.000122
17.0       1    0.000122
16.0       1    0.000122
45.0       1    0.000122
18.0       1    0.000122
40.0       1    0.000122
29.0       1    0.000122
31.0       1    0.000122
ANZ_TITEL
     counts  percentage
0.0  814542   99.611115
1.0    2970    0.363204
2.0     202    0.024703
3.0       5    0.000611
4.0       2    0.000245
6.0       1    0.000122
HH_EINKOMMEN_SCORE
     counts  percentage
6.0  252775   28.958967
5.0  201482   23.082625
2.0  140817   16.132587
4.0  139762   16.011722
3.0   84805    9.715617
1.0   53232    6.098482
KK_KUNDENTYP
     counts  percentage
3.0   65151   21.248887
2.0   62564   20.405141
5.0   48038   15.667511
4.0   44512   14.517513
6.0   44114   14.387706
1.0   42230   13.773242
W_KEIT_KIND_HH
     counts  percentage
6.0  281966   35.982537
4.0  128675   16.420607
3.0  100170   12.782998
2.0   84000   10.719495
1.0   83706   10.681977
5.0   64716    8.258605
0.0   40386    5.153780
WOHNDAUER_2008
     counts  percentage
9.0  551176   67.403837
8.0   80118    9.797706
4.0   50736    6.204554
3.0   38767    4.740853
6.0   35170    4.300973
5.0   30959    3.786006
7.0   23939    2.927523
2.0    6174    0.755024
1.0     683    0.083525
ANZ_HAUSHALTE_AKTIV
       counts  percentage
1.0    195957   24.553769
2.0    120982   15.159265
3.0     62575    7.840761
4.0     43213    5.414668
5.0     37815    4.738288
6.0     36020    4.513372
7.0     34526    4.326171
8.0     32293    4.046372
9.0     29002    3.634003
10.0    25428    3.186175
11.0    21965    2.752254
12.0    18033    2.259568
13.0    15282    1.914862
14.0    12625    1.581935
15.0    10371    1.299505
16.0     8899    1.115061
17.0     7292    0.913701
0.0      6463    0.809826
18.0     6324    0.792409
19.0     5461    0.684273
20.0     4674    0.585661
21.0     4138    0.518499
22.0     3735    0.468002
23.0     3243    0.406354
24.0     2838    0.355607
25.0     2636    0.330296
26.0     2342    0.293457
27.0     2232    0.279674
28.0     2040    0.255616
29.0     1963    0.245967
30.0     1821    0.228175
31.0     1634    0.204743
32.0     1616    0.202488
33.0     1551    0.194343
34.0     1437    0.180059
35.0     1320    0.165398
37.0     1317    0.165022
36.0     1281    0.160512
38.0     1255    0.157254
40.0     1115    0.139712
39.0     1113    0.139461
42.0     1043    0.130690
41.0     1031    0.129186
44.0      829    0.103875
43.0      819    0.102622
46.0      771    0.096608
45.0      686    0.085957
47.0      665    0.083326
48.0      662    0.082950
49.0      609    0.076309
50.0      548    0.068665
52.0      538    0.067412
51.0      474    0.059393
55.0      436    0.054632
54.0      436    0.054632
53.0      427    0.053504
56.0      399    0.049995
58.0      394    0.049369
57.0      376    0.047113
61.0      371    0.046487
59.0      346    0.043354
64.0      312    0.039094
62.0      308    0.038593
60.0      295    0.036964
63.0      291    0.036463
67.0      248    0.031075
70.0      248    0.031075
68.0      246    0.030824
73.0      245    0.030699
66.0      244    0.030574
72.0      226    0.028318
65.0      201    0.025186
69.0      192    0.024058
71.0      188    0.023557
77.0      182    0.022805
75.0      171    0.021427
74.0      171    0.021427
80.0      150    0.018795
76.0      149    0.018670
82.0      138    0.017292
79.0      136    0.017041
85.0      131    0.016415
84.0      131    0.016415
81.0      130    0.016289
78.0      128    0.016039
91.0      123    0.015412
83.0      123    0.015412
86.0      122    0.015287
87.0      119    0.014911
89.0      117    0.014660
93.0      116    0.014535
92.0      114    0.014284
88.0      107    0.013407
90.0      101    0.012655
97.0       99    0.012405
102.0      96    0.012029
95.0       91    0.011402
98.0       91    0.011402
103.0      90    0.011277
99.0       86    0.010776
...       ...         ...
348.0      11    0.001378
163.0      11    0.001378
265.0      11    0.001378
236.0      10    0.001253
184.0      10    0.001253
218.0      10    0.001253
190.0      10    0.001253
258.0      10    0.001253
194.0      10    0.001253
277.0       9    0.001128
172.0       9    0.001128
367.0       9    0.001128
314.0       9    0.001128
438.0       9    0.001128
189.0       9    0.001128
209.0       9    0.001128
151.0       9    0.001128
274.0       9    0.001128
221.0       9    0.001128
219.0       9    0.001128
344.0       9    0.001128
321.0       8    0.001002
595.0       8    0.001002
177.0       8    0.001002
333.0       8    0.001002
169.0       8    0.001002
200.0       8    0.001002
207.0       8    0.001002
185.0       8    0.001002
318.0       8    0.001002
252.0       8    0.001002
346.0       7    0.000877
242.0       7    0.000877
267.0       7    0.000877
353.0       7    0.000877
276.0       7    0.000877
328.0       7    0.000877
286.0       7    0.000877
211.0       7    0.000877
206.0       7    0.000877
241.0       7    0.000877
208.0       7    0.000877
201.0       7    0.000877
445.0       7    0.000877
191.0       7    0.000877
317.0       6    0.000752
304.0       6    0.000752
202.0       6    0.000752
215.0       6    0.000752
379.0       6    0.000752
259.0       6    0.000752
193.0       6    0.000752
430.0       6    0.000752
283.0       6    0.000752
214.0       6    0.000752
290.0       6    0.000752
166.0       6    0.000752
204.0       6    0.000752
243.0       6    0.000752
377.0       5    0.000627
205.0       5    0.000627
231.0       5    0.000627
197.0       5    0.000627
280.0       5    0.000627
263.0       5    0.000627
229.0       5    0.000627
240.0       5    0.000627
228.0       5    0.000627
316.0       5    0.000627
326.0       4    0.000501
285.0       4    0.000501
515.0       4    0.000501
523.0       4    0.000501
301.0       4    0.000501
249.0       4    0.000501
174.0       4    0.000501
266.0       4    0.000501
256.0       4    0.000501
255.0       4    0.000501
250.0       4    0.000501
260.0       4    0.000501
331.0       4    0.000501
226.0       3    0.000376
224.0       3    0.000376
168.0       3    0.000376
307.0       3    0.000376
414.0       3    0.000376
244.0       3    0.000376
378.0       3    0.000376
293.0       3    0.000376
272.0       3    0.000376
395.0       3    0.000376
237.0       2    0.000251
254.0       2    0.000251
404.0       2    0.000251
213.0       2    0.000251
366.0       1    0.000125
536.0       1    0.000125
232.0       1    0.000125
220.0       1    0.000125

[292 rows x 2 columns]
ANZ_HH_TITEL
      counts  percentage
0.0   770244   96.982044
1.0    20157    2.537984
2.0     2459    0.309615
3.0      585    0.073658
4.0      232    0.029211
5.0      117    0.014732
6.0      106    0.013347
8.0       68    0.008562
7.0       65    0.008184
9.0       34    0.004281
13.0      29    0.003651
12.0      22    0.002770
11.0      22    0.002770
14.0      16    0.002015
10.0      16    0.002015
17.0      13    0.001637
20.0       9    0.001133
15.0       7    0.000881
18.0       6    0.000755
16.0       3    0.000378
23.0       3    0.000378
GEBAEUDETYP
     counts  percentage
1.0  460465   57.697103
3.0  178668   22.387426
8.0  152476   19.105520
2.0    4935    0.618364
4.0     900    0.112772
6.0     628    0.078690
5.0       1    0.000125
KONSUMNAEHE
     counts  percentage
1.0  193738   23.706029
3.0  171127   20.939319
5.0  153535   18.786739
2.0  134665   16.477782
4.0  133324   16.313695
6.0   26625    3.257869
7.0    4238    0.518567
MIN_GEBAEUDEJAHR
        counts  percentage
1992.0  568776   71.268668
1994.0   78835    9.878169
1993.0   25488    3.193693
1995.0   25464    3.190686
1996.0   16611    2.081389
1997.0   14464    1.812366
2000.0    7382    0.924978
2001.0    5877    0.736399
1991.0    5811    0.728129
2005.0    5553    0.695801
1999.0    4413    0.552957
1990.0    4408    0.552330
2002.0    4216    0.528272
1998.0    4097    0.513362
2003.0    3356    0.420513
2004.0    2935    0.367761
2008.0    2197    0.275288
2007.0    2156    0.270151
1989.0    2046    0.256368
2009.0    2016    0.252608
2006.0    1984    0.248599
2011.0    1903    0.238449
2012.0    1861    0.233187
2010.0    1410    0.176676
2013.0    1230    0.154121
1988.0    1027    0.128685
2014.0    1001    0.125427
2015.0     717    0.089841
1987.0     470    0.058892
2016.0     128    0.016039
1986.0     125    0.015663
1985.0     116    0.014535
OST_WEST_KZ
   counts  percentage
W  629528   78.881005
O  168545   21.118995
WOHNLAGE
     counts  percentage
3.0  249719   31.290245
7.0  169318   21.215854
4.0  135973   17.037664
2.0  100376   12.577296
5.0   74346    9.315689
1.0   43918    5.503005
8.0   17473    2.189399
0.0    6950    0.870848
CAMEO_DEUG_2015
   counts  percentage
8  134441   16.969689
9  108177   13.654540
6  105874   13.363846
4  103912   13.116194
3   86779   10.953598
2   83231   10.505755
7   77933    9.837019
5   55310    6.981453
1   36212    4.570826
X     373    0.047082
CAMEO_DEU_2015
    counts  percentage
6B   56672    7.153370
8A   52438    6.618937
4C   47819    6.035908
2D   35074    4.427183
3C   34769    4.388684
7A   34399    4.341981
3D   34307    4.330369
8B   33434    4.220175
4A   33155    4.184959
8C   30993    3.912062
9D   28593    3.609124
9B   27676    3.493377
9C   24987    3.153961
7B   24503    3.092868
9A   20542    2.592895
2C   19422    2.451524
8D   17576    2.218514
6E   16107    2.033091
2B   15486    1.954706
5D   14943    1.886166
6C   14820    1.870641
2A   13249    1.672343
5A   12214    1.541701
1D   11909    1.503202
1A   10850    1.369531
3A   10543    1.330780
5B   10354    1.306924
5C    9935    1.254036
7C    9065    1.144221
4B    9047    1.141949
4D    8570    1.081740
3B    7160    0.903764
6A    6810    0.859586
9E    6379    0.805183
6D    6073    0.766559
6F    5392    0.680600
7D    5333    0.673153
4E    5321    0.671638
1E    5065    0.639325
7E    4633    0.584796
1C    4317    0.544909
5F    4283    0.540618
1B    4071    0.513858
5E    3581    0.452008
XX     373    0.047082
CAMEO_INTL_2015
    counts  percentage
51  133694   16.875399
41   92336   11.655025
24   91158   11.506333
14   62884    7.937474
43   56672    7.153370
54   45391    5.729436
25   39628    5.002007
22   33155    4.184959
23   26750    3.376494
13   26336    3.324237
45   26132    3.298487
55   23955    3.023697
52   20542    2.592895
31   19024    2.401286
34   18524    2.338174
15   16974    2.142527
44   14820    1.870641
12   13249    1.672343
35   10356    1.307176
32   10354    1.306924
33    9935    1.254036
XX     373    0.047082
KBA05_ANTG1
     counts  percentage
0.0  261049   34.443862
1.0  161224   21.272548
2.0  126725   16.720610
3.0  117762   15.537995
4.0   91137   12.024985
KBA05_ANTG2
     counts  percentage
0.0  292538   38.598649
1.0  163751   21.605970
2.0  138273   18.244300
3.0  134455   17.740537
4.0   28880    3.810544
KBA05_ANTG3
     counts  percentage
0.0  511545   67.495319
1.0   92748   12.237547
2.0   80234   10.586399
3.0   73370    9.680735
KBA05_ANTG4
     counts  percentage
0.0  600171   79.188993
1.0   83591   11.029335
2.0   74135    9.781672
KBA05_BAUMAX
     counts  percentage
0.0  343200   45.283198
1.0  208417   27.499383
5.0   98923   13.052301
3.0   59955    7.910706
4.0   37718    4.976666
2.0    9684    1.277746
KBA05_GBZ
     counts  percentage
3.0  197833   26.102887
5.0  158971   20.975278
4.0  155301   20.491043
2.0  138528   18.277945
1.0  107264   14.152847
BALLRAUM
     counts  percentage
6.0  255093   31.987345
1.0  151782   19.032679
2.0  104521   13.106394
7.0   99039   12.418979
3.0   73277    9.188557
4.0   61358    7.693976
5.0   52411    6.572069
EWDICHTE
     counts  percentage
6.0  201009   25.205491
5.0  161209   20.214776
2.0  139087   17.440792
4.0  130716   16.391112
1.0   84051   10.539561
3.0   81409   10.208268
INNENSTADT
     counts  percentage
5.0  147626   18.511538
4.0  134067   16.811310
6.0  111679   14.003970
2.0  109048   13.674056
3.0   92818   11.638898
8.0   82870   10.391470
7.0   67463    8.459512
1.0   51910    6.509246
GEBAEUDETYP_RASTER
     counts  percentage
4.0  359620   45.061436
3.0  205330   25.728449
5.0  159217   19.950355
2.0   58961    7.387985
1.0   14938    1.871775
KKK
     counts  percentage
3.0  273024   35.456511
2.0  181519   23.573131
4.0  178648   23.200286
1.0   99966   12.982176
0.0   36868    4.787896
MOBI_REGIO
     counts  percentage
1.0  163993   21.637901
3.0  150336   19.835941
5.0  148713   19.621796
4.0  148209   19.555296
2.0  146305   19.304074
6.0     341    0.044993
ONLINE_AFFINITAET
     counts  percentage
2.0  197850   22.321454
4.0  164704   18.581919
3.0  163487   18.444617
1.0  156499   17.656230
5.0  138111   15.581695
0.0   65716    7.414085
REGIOTYP
     counts  percentage
6.0  195286   25.360995
5.0  145359   18.877179
3.0   93929   12.198175
2.0   91662   11.903769
7.0   83943   10.901334
4.0   68180    8.854258
1.0   54798    7.116392
0.0   36868    4.787896
KBA13_ANZAHL_PKW
        counts  percentage
1400.0   11722    1.492448
1500.0    8291    1.055612
1300.0    6427    0.818287
1600.0    6135    0.781110
1700.0    3795    0.483180
1800.0    2617    0.333197
464.0     1604    0.204222
417.0     1604    0.204222
519.0     1600    0.203712
534.0     1496    0.190471
386.0     1458    0.185633
1900.0    1450    0.184614
395.0     1446    0.184105
481.0     1417    0.180413
455.0     1409    0.179394
483.0     1393    0.177357
452.0     1388    0.176721
418.0     1384    0.176211
454.0     1380    0.175702
450.0     1380    0.175702
494.0     1379    0.175575
459.0     1379    0.175575
492.0     1359    0.173028
504.0     1340    0.170609
387.0     1338    0.170354
420.0     1337    0.170227
439.0     1327    0.168954
506.0     1326    0.168827
388.0     1324    0.168572
456.0     1323    0.168445
487.0     1319    0.167935
402.0     1318    0.167808
421.0     1317    0.167681
499.0     1310    0.166790
466.0     1308    0.166535
491.0     1302    0.165771
490.0     1302    0.165771
558.0     1301    0.165644
477.0     1298    0.165262
489.0     1296    0.165007
567.0     1292    0.164498
536.0     1290    0.164243
406.0     1288    0.163988
516.0     1284    0.163479
393.0     1282    0.163225
453.0     1282    0.163225
389.0     1281    0.163097
390.0     1280    0.162970
556.0     1278    0.162715
584.0     1278    0.162715
438.0     1276    0.162461
574.0     1274    0.162206
485.0     1273    0.162079
537.0     1267    0.161315
517.0     1266    0.161187
479.0     1264    0.160933
508.0     1262    0.160678
497.0     1262    0.160678
377.0     1257    0.160042
478.0     1256    0.159914
500.0     1255    0.159787
352.0     1254    0.159660
572.0     1254    0.159660
375.0     1254    0.159660
467.0     1254    0.159660
446.0     1253    0.159532
409.0     1252    0.159405
451.0     1245    0.158514
429.0     1245    0.158514
384.0     1243    0.158259
515.0     1242    0.158132
426.0     1241    0.158004
434.0     1240    0.157877
518.0     1239    0.157750
470.0     1232    0.156859
554.0     1232    0.156859
488.0     1232    0.156859
471.0     1229    0.156477
369.0     1228    0.156349
382.0     1225    0.155967
360.0     1224    0.155840
442.0     1223    0.155713
410.0     1221    0.155458
502.0     1220    0.155331
430.0     1220    0.155331
399.0     1218    0.155076
396.0     1217    0.154949
345.0     1215    0.154694
509.0     1214    0.154567
428.0     1213    0.154439
565.0     1212    0.154312
380.0     1207    0.153676
475.0     1205    0.153421
597.0     1203    0.153166
549.0     1201    0.152912
412.0     1201    0.152912
469.0     1200    0.152784
530.0     1200    0.152784
457.0     1200    0.152784
437.0     1198    0.152530
...        ...         ...
1168.0     113    0.014387
1177.0     112    0.014260
76.0       111    0.014133
1198.0     107    0.013623
88.0       107    0.013623
1074.0     103    0.013114
1221.0     103    0.013114
69.0       101    0.012859
71.0        99    0.012605
1133.0      98    0.012477
100.0       97    0.012350
64.0        95    0.012095
1224.0      95    0.012095
80.0        94    0.011968
70.0        91    0.011586
73.0        89    0.011332
74.0        89    0.011332
1233.0      87    0.011077
1122.0      85    0.010822
1213.0      85    0.010822
75.0        85    0.010822
51.0        84    0.010695
78.0        84    0.010695
68.0        84    0.010695
63.0        80    0.010186
1093.0      78    0.009931
47.0        77    0.009804
59.0        76    0.009676
1247.0      76    0.009676
52.0        75    0.009549
1184.0      75    0.009549
77.0        74    0.009422
72.0        74    0.009422
1115.0      73    0.009294
1185.0      71    0.009040
66.0        69    0.008785
1098.0      67    0.008530
65.0        67    0.008530
62.0        66    0.008403
67.0        66    0.008403
0.0         62    0.007894
1225.0      61    0.007767
58.0        60    0.007639
45.0        59    0.007512
61.0        58    0.007385
1232.0      57    0.007257
60.0        54    0.006875
56.0        53    0.006748
55.0        53    0.006748
53.0        53    0.006748
48.0        53    0.006748
44.0        52    0.006621
54.0        50    0.006366
46.0        46    0.005857
57.0        46    0.005857
37.0        45    0.005729
41.0        44    0.005602
35.0        43    0.005475
34.0        42    0.005347
40.0        42    0.005347
36.0        38    0.004838
50.0        37    0.004711
42.0        35    0.004456
38.0        35    0.004456
33.0        31    0.003947
31.0        29    0.003692
39.0        28    0.003565
32.0        28    0.003565
49.0        27    0.003438
43.0        27    0.003438
28.0        24    0.003056
27.0        24    0.003056
25.0        23    0.002928
24.0        22    0.002801
26.0        21    0.002674
18.0        21    0.002674
17.0        20    0.002546
20.0        18    0.002292
21.0        17    0.002164
22.0        16    0.002037
12.0        16    0.002037
14.0        16    0.002037
29.0        15    0.001910
15.0        14    0.001782
23.0        13    0.001655
30.0        12    0.001528
16.0        11    0.001401
19.0        11    0.001401
13.0        10    0.001273
1.0          8    0.001019
10.0         8    0.001019
11.0         7    0.000891
5.0          7    0.000891
9.0          7    0.000891
4.0          7    0.000891
3.0          6    0.000764
8.0          6    0.000764
2.0          6    0.000764
7.0          5    0.000637
6.0          5    0.000637

[1261 rows x 2 columns]
PLZ8_ANTG1
     counts  percentage
2.0  270590   34.928089
3.0  222355   28.701856
1.0  189247   24.428235
4.0   87044   11.235746
0.0    5470    0.706074
PLZ8_ANTG2
     counts  percentage
3.0  307283   39.664466
2.0  215767   27.851469
4.0  191005   24.655160
1.0   53213    6.868799
0.0    7438    0.960106
PLZ8_ANTG3
     counts  percentage
2.0  252994   32.656776
1.0  237878   30.705584
3.0  164040   21.174484
0.0  119794   15.463156
PLZ8_ANTG4
     counts  percentage
0.0  356389   46.003129
1.0  294986   38.077154
2.0  123331   15.919717
PLZ8_BAUMAX
     counts  percentage
1.0  499550   64.482526
5.0   97333   12.563863
2.0   70407    9.088222
4.0   56684    7.316840
3.0   50732    6.548549
PLZ8_HHZ
     counts  percentage
3.0  309146   39.904945
4.0  211911   27.353732
5.0  175813   22.694158
2.0   66891    8.634372
1.0   10945    1.412794
PLZ8_GBZ
     counts  percentage
3.0  288383   37.224831
4.0  180252   23.267149
5.0  153883   19.863406
2.0  111588   14.403916
1.0   40600    5.240698
ARBEIT
     counts  percentage
4.0  311339   39.211214
3.0  254988   32.114155
2.0  135662   17.085787
1.0   56767    7.149451
5.0   35090    4.419368
9.0     159    0.020025
ORTSGR_KLS9
     counts  percentage
5.0  148096   18.651772
4.0  114909   14.472075
7.0  102866   12.955334
9.0   91879   11.571590
3.0   83542   10.521596
6.0   75995    9.571098
8.0   72709    9.157247
2.0   63362    7.980051
1.0   40589    5.111933
0.0      58    0.007305
RELAT_AB
     counts  percentage
3.0  274008   34.509606
5.0  174964   22.035629
1.0  142907   17.998249
2.0  104846   13.204703
4.0   97121   12.231787
9.0     159    0.020025

Explore AZDIAS_Feature_Summary.csv: Summary of feature attributes for demographics data; 85 features (rows) x 4 columns

In [16]:
explore_data(feat_info)   
Shape: rows, cols
(85, 4)


Dataframe Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85 entries, 0 to 84
Data columns (total 4 columns):
attribute             85 non-null object
information_level     85 non-null object
type                  85 non-null object
missing_or_unknown    85 non-null object
dtypes: object(4)
memory usage: 2.7+ KB
None

Desciptive stats:
              attribute information_level     type missing_or_unknown
count                85                85       85                 85
unique               85                 9        5                  9
top     CAMEO_DEUG_2015            person  ordinal               [-1]
freq                  1                43       49                 26
In [17]:
feat_info
Out[17]:
attribute information_level type missing_or_unknown
0 AGER_TYP person categorical [-1,0]
1 ALTERSKATEGORIE_GROB person ordinal [-1,0,9]
2 ANREDE_KZ person categorical [-1,0]
3 CJT_GESAMTTYP person categorical [0]
4 FINANZ_MINIMALIST person ordinal [-1]
5 FINANZ_SPARER person ordinal [-1]
6 FINANZ_VORSORGER person ordinal [-1]
7 FINANZ_ANLEGER person ordinal [-1]
8 FINANZ_UNAUFFAELLIGER person ordinal [-1]
9 FINANZ_HAUSBAUER person ordinal [-1]
10 FINANZTYP person categorical [-1]
11 GEBURTSJAHR person numeric [0]
12 GFK_URLAUBERTYP person categorical []
13 GREEN_AVANTGARDE person categorical []
14 HEALTH_TYP person ordinal [-1,0]
15 LP_LEBENSPHASE_FEIN person mixed [0]
16 LP_LEBENSPHASE_GROB person mixed [0]
17 LP_FAMILIE_FEIN person categorical [0]
18 LP_FAMILIE_GROB person categorical [0]
19 LP_STATUS_FEIN person categorical [0]
20 LP_STATUS_GROB person categorical [0]
21 NATIONALITAET_KZ person categorical [-1,0]
22 PRAEGENDE_JUGENDJAHRE person mixed [-1,0]
23 RETOURTYP_BK_S person ordinal [0]
24 SEMIO_SOZ person ordinal [-1,9]
25 SEMIO_FAM person ordinal [-1,9]
26 SEMIO_REL person ordinal [-1,9]
27 SEMIO_MAT person ordinal [-1,9]
28 SEMIO_VERT person ordinal [-1,9]
29 SEMIO_LUST person ordinal [-1,9]
30 SEMIO_ERL person ordinal [-1,9]
31 SEMIO_KULT person ordinal [-1,9]
32 SEMIO_RAT person ordinal [-1,9]
33 SEMIO_KRIT person ordinal [-1,9]
34 SEMIO_DOM person ordinal [-1,9]
35 SEMIO_KAEM person ordinal [-1,9]
36 SEMIO_PFLICHT person ordinal [-1,9]
37 SEMIO_TRADV person ordinal [-1,9]
38 SHOPPER_TYP person categorical [-1]
39 SOHO_KZ person categorical [-1]
40 TITEL_KZ person categorical [-1,0]
41 VERS_TYP person categorical [-1]
42 ZABEOTYP person categorical [-1,9]
43 ALTER_HH household interval [0]
44 ANZ_PERSONEN household numeric []
45 ANZ_TITEL household numeric []
46 HH_EINKOMMEN_SCORE household ordinal [-1,0]
47 KK_KUNDENTYP household categorical [-1]
48 W_KEIT_KIND_HH household ordinal [-1,0]
49 WOHNDAUER_2008 household ordinal [-1,0]
50 ANZ_HAUSHALTE_AKTIV building numeric [0]
51 ANZ_HH_TITEL building numeric []
52 GEBAEUDETYP building categorical [-1,0]
53 KONSUMNAEHE building ordinal []
54 MIN_GEBAEUDEJAHR building numeric [0]
55 OST_WEST_KZ building categorical [-1]
56 WOHNLAGE building mixed [-1]
57 CAMEO_DEUG_2015 microcell_rr4 categorical [-1,X]
58 CAMEO_DEU_2015 microcell_rr4 categorical [XX]
59 CAMEO_INTL_2015 microcell_rr4 mixed [-1,XX]
60 KBA05_ANTG1 microcell_rr3 ordinal [-1]
61 KBA05_ANTG2 microcell_rr3 ordinal [-1]
62 KBA05_ANTG3 microcell_rr3 ordinal [-1]
63 KBA05_ANTG4 microcell_rr3 ordinal [-1]
64 KBA05_BAUMAX microcell_rr3 mixed [-1,0]
65 KBA05_GBZ microcell_rr3 ordinal [-1,0]
66 BALLRAUM postcode ordinal [-1]
67 EWDICHTE postcode ordinal [-1]
68 INNENSTADT postcode ordinal [-1]
69 GEBAEUDETYP_RASTER region_rr1 ordinal []
70 KKK region_rr1 ordinal [-1,0]
71 MOBI_REGIO region_rr1 ordinal []
72 ONLINE_AFFINITAET region_rr1 ordinal []
73 REGIOTYP region_rr1 ordinal [-1,0]
74 KBA13_ANZAHL_PKW macrocell_plz8 numeric []
75 PLZ8_ANTG1 macrocell_plz8 ordinal [-1]
76 PLZ8_ANTG2 macrocell_plz8 ordinal [-1]
77 PLZ8_ANTG3 macrocell_plz8 ordinal [-1]
78 PLZ8_ANTG4 macrocell_plz8 ordinal [-1]
79 PLZ8_BAUMAX macrocell_plz8 mixed [-1,0]
80 PLZ8_HHZ macrocell_plz8 ordinal [-1]
81 PLZ8_GBZ macrocell_plz8 ordinal [-1]
82 ARBEIT community ordinal [-1,9]
83 ORTSGR_KLS9 community ordinal [-1,0]
84 RELAT_AB community ordinal [-1,9]
In [18]:
feat_info.head(10)
Out[18]:
attribute information_level type missing_or_unknown
0 AGER_TYP person categorical [-1,0]
1 ALTERSKATEGORIE_GROB person ordinal [-1,0,9]
2 ANREDE_KZ person categorical [-1,0]
3 CJT_GESAMTTYP person categorical [0]
4 FINANZ_MINIMALIST person ordinal [-1]
5 FINANZ_SPARER person ordinal [-1]
6 FINANZ_VORSORGER person ordinal [-1]
7 FINANZ_ANLEGER person ordinal [-1]
8 FINANZ_UNAUFFAELLIGER person ordinal [-1]
9 FINANZ_HAUSBAUER person ordinal [-1]
In [19]:
feat_info
Out[19]:
attribute information_level type missing_or_unknown
0 AGER_TYP person categorical [-1,0]
1 ALTERSKATEGORIE_GROB person ordinal [-1,0,9]
2 ANREDE_KZ person categorical [-1,0]
3 CJT_GESAMTTYP person categorical [0]
4 FINANZ_MINIMALIST person ordinal [-1]
5 FINANZ_SPARER person ordinal [-1]
6 FINANZ_VORSORGER person ordinal [-1]
7 FINANZ_ANLEGER person ordinal [-1]
8 FINANZ_UNAUFFAELLIGER person ordinal [-1]
9 FINANZ_HAUSBAUER person ordinal [-1]
10 FINANZTYP person categorical [-1]
11 GEBURTSJAHR person numeric [0]
12 GFK_URLAUBERTYP person categorical []
13 GREEN_AVANTGARDE person categorical []
14 HEALTH_TYP person ordinal [-1,0]
15 LP_LEBENSPHASE_FEIN person mixed [0]
16 LP_LEBENSPHASE_GROB person mixed [0]
17 LP_FAMILIE_FEIN person categorical [0]
18 LP_FAMILIE_GROB person categorical [0]
19 LP_STATUS_FEIN person categorical [0]
20 LP_STATUS_GROB person categorical [0]
21 NATIONALITAET_KZ person categorical [-1,0]
22 PRAEGENDE_JUGENDJAHRE person mixed [-1,0]
23 RETOURTYP_BK_S person ordinal [0]
24 SEMIO_SOZ person ordinal [-1,9]
25 SEMIO_FAM person ordinal [-1,9]
26 SEMIO_REL person ordinal [-1,9]
27 SEMIO_MAT person ordinal [-1,9]
28 SEMIO_VERT person ordinal [-1,9]
29 SEMIO_LUST person ordinal [-1,9]
30 SEMIO_ERL person ordinal [-1,9]
31 SEMIO_KULT person ordinal [-1,9]
32 SEMIO_RAT person ordinal [-1,9]
33 SEMIO_KRIT person ordinal [-1,9]
34 SEMIO_DOM person ordinal [-1,9]
35 SEMIO_KAEM person ordinal [-1,9]
36 SEMIO_PFLICHT person ordinal [-1,9]
37 SEMIO_TRADV person ordinal [-1,9]
38 SHOPPER_TYP person categorical [-1]
39 SOHO_KZ person categorical [-1]
40 TITEL_KZ person categorical [-1,0]
41 VERS_TYP person categorical [-1]
42 ZABEOTYP person categorical [-1,9]
43 ALTER_HH household interval [0]
44 ANZ_PERSONEN household numeric []
45 ANZ_TITEL household numeric []
46 HH_EINKOMMEN_SCORE household ordinal [-1,0]
47 KK_KUNDENTYP household categorical [-1]
48 W_KEIT_KIND_HH household ordinal [-1,0]
49 WOHNDAUER_2008 household ordinal [-1,0]
50 ANZ_HAUSHALTE_AKTIV building numeric [0]
51 ANZ_HH_TITEL building numeric []
52 GEBAEUDETYP building categorical [-1,0]
53 KONSUMNAEHE building ordinal []
54 MIN_GEBAEUDEJAHR building numeric [0]
55 OST_WEST_KZ building categorical [-1]
56 WOHNLAGE building mixed [-1]
57 CAMEO_DEUG_2015 microcell_rr4 categorical [-1,X]
58 CAMEO_DEU_2015 microcell_rr4 categorical [XX]
59 CAMEO_INTL_2015 microcell_rr4 mixed [-1,XX]
60 KBA05_ANTG1 microcell_rr3 ordinal [-1]
61 KBA05_ANTG2 microcell_rr3 ordinal [-1]
62 KBA05_ANTG3 microcell_rr3 ordinal [-1]
63 KBA05_ANTG4 microcell_rr3 ordinal [-1]
64 KBA05_BAUMAX microcell_rr3 mixed [-1,0]
65 KBA05_GBZ microcell_rr3 ordinal [-1,0]
66 BALLRAUM postcode ordinal [-1]
67 EWDICHTE postcode ordinal [-1]
68 INNENSTADT postcode ordinal [-1]
69 GEBAEUDETYP_RASTER region_rr1 ordinal []
70 KKK region_rr1 ordinal [-1,0]
71 MOBI_REGIO region_rr1 ordinal []
72 ONLINE_AFFINITAET region_rr1 ordinal []
73 REGIOTYP region_rr1 ordinal [-1,0]
74 KBA13_ANZAHL_PKW macrocell_plz8 numeric []
75 PLZ8_ANTG1 macrocell_plz8 ordinal [-1]
76 PLZ8_ANTG2 macrocell_plz8 ordinal [-1]
77 PLZ8_ANTG3 macrocell_plz8 ordinal [-1]
78 PLZ8_ANTG4 macrocell_plz8 ordinal [-1]
79 PLZ8_BAUMAX macrocell_plz8 mixed [-1,0]
80 PLZ8_HHZ macrocell_plz8 ordinal [-1]
81 PLZ8_GBZ macrocell_plz8 ordinal [-1]
82 ARBEIT community ordinal [-1,9]
83 ORTSGR_KLS9 community ordinal [-1,0]
84 RELAT_AB community ordinal [-1,9]
In [20]:
# bar_graph_for_each_column?
In [21]:
bar_graph_for_each_column(feat_info, column_list=['information_level','type'])
information_level
                counts  percentage
person              43   50.588235
macrocell_plz8       8    9.411765
building             7    8.235294
household            7    8.235294
microcell_rr3        6    7.058824
region_rr1           5    5.882353
community            3    3.529412
microcell_rr4        3    3.529412
postcode             3    3.529412
type
             counts  percentage
ordinal          49   57.647059
categorical      21   24.705882
mixed             7    8.235294
numeric           7    8.235294
interval          1    1.176471

Most data is about a person and is ordinal. Second group, after ordinal, is categorical.

Tip: Add additional cells to keep everything in reasonably-sized chunks! Keyboard shortcut esc --> a (press escape to enter command mode, then press the 'A' key) adds a new cell before the active cell, and esc --> b adds a new cell after the active cell. If you need to convert an active cell to a markdown cell, use esc --> m and to convert to a code cell, use esc --> y.

Step 1: Preprocessing

Step 1.1: Assess Missing Data

The feature summary file contains a summary of properties for each demographics data column. You will use this file to help you make cleaning decisions during this stage of the project. First of all, you should assess the demographics data in terms of missing data. Pay attention to the following points as you perform your analysis, and take notes on what you observe. Make sure that you fill in the Discussion cell with your findings and decisions at the end of each step that has one!

Step 1.1.1: Convert Missing Value Codes to NaNs

The fourth column of the feature attributes summary (loaded in above as feat_info) documents the codes from the data dictionary that indicate missing or unknown data. While the file encodes this as a list (e.g. [-1,0]), this will get read in as a string object. You'll need to do a little bit of parsing to make use of it to identify and clean the data. Convert data that matches a 'missing' or 'unknown' value code into a numpy NaN value. You might want to see how much data takes on a 'missing' or 'unknown' code, and how much data is naturally missing, as a point of interest.

As one more reminder, you are encouraged to add additional cells to break up your analysis into manageable chunks.

Identify missing or unknown data values and convert them to NaNs.

In [23]:
# how many missing (NAN) values there are in each column
# this is BEFORE we convert categories which are not misssing values, but denote missing or unknown information
# This is how much data is NATURALLY missing (Python NAN, SQL null, Excel blank)

azdias.isnull().sum()    
# this will give you the same result:
# azdias.isna().sum()       

# If you add it up to azdias.info() non-null, you will get 891221 in each row.
Out[23]:
AGER_TYP                      0
ALTERSKATEGORIE_GROB          0
ANREDE_KZ                     0
CJT_GESAMTTYP              4854
FINANZ_MINIMALIST             0
FINANZ_SPARER                 0
FINANZ_VORSORGER              0
FINANZ_ANLEGER                0
FINANZ_UNAUFFAELLIGER         0
FINANZ_HAUSBAUER              0
FINANZTYP                     0
GEBURTSJAHR                   0
GFK_URLAUBERTYP            4854
GREEN_AVANTGARDE              0
HEALTH_TYP                    0
LP_LEBENSPHASE_FEIN        4854
LP_LEBENSPHASE_GROB        4854
LP_FAMILIE_FEIN            4854
LP_FAMILIE_GROB            4854
LP_STATUS_FEIN             4854
LP_STATUS_GROB             4854
NATIONALITAET_KZ              0
PRAEGENDE_JUGENDJAHRE         0
RETOURTYP_BK_S             4854
SEMIO_SOZ                     0
SEMIO_FAM                     0
SEMIO_REL                     0
SEMIO_MAT                     0
SEMIO_VERT                    0
SEMIO_LUST                    0
SEMIO_ERL                     0
SEMIO_KULT                    0
SEMIO_RAT                     0
SEMIO_KRIT                    0
SEMIO_DOM                     0
SEMIO_KAEM                    0
SEMIO_PFLICHT                 0
SEMIO_TRADV                   0
SHOPPER_TYP                   0
SOHO_KZ                   73499
TITEL_KZ                  73499
VERS_TYP                      0
ZABEOTYP                      0
ALTER_HH                  73499
ANZ_PERSONEN              73499
ANZ_TITEL                 73499
HH_EINKOMMEN_SCORE        18348
KK_KUNDENTYP             584612
W_KEIT_KIND_HH           107602
WOHNDAUER_2008            73499
ANZ_HAUSHALTE_AKTIV       93148
ANZ_HH_TITEL              97008
GEBAEUDETYP               93148
KONSUMNAEHE               73969
MIN_GEBAEUDEJAHR          93148
OST_WEST_KZ               93148
WOHNLAGE                  93148
CAMEO_DEUG_2015           98979
CAMEO_DEU_2015            98979
CAMEO_INTL_2015           98979
KBA05_ANTG1              133324
KBA05_ANTG2              133324
KBA05_ANTG3              133324
KBA05_ANTG4              133324
KBA05_BAUMAX             133324
KBA05_GBZ                133324
BALLRAUM                  93740
EWDICHTE                  93740
INNENSTADT                93740
GEBAEUDETYP_RASTER        93155
KKK                      121196
MOBI_REGIO               133324
ONLINE_AFFINITAET          4854
REGIOTYP                 121196
KBA13_ANZAHL_PKW         105800
PLZ8_ANTG1               116515
PLZ8_ANTG2               116515
PLZ8_ANTG3               116515
PLZ8_ANTG4               116515
PLZ8_BAUMAX              116515
PLZ8_HHZ                 116515
PLZ8_GBZ                 116515
ARBEIT                    97216
ORTSGR_KLS9               97216
RELAT_AB                  97216
dtype: int64
In [24]:
# how many missing (NAN) values there are in each column
# this is BEFORE we convert categories which are not misssing values, but denote missing or unknown information
# This is how much data is NATURALLY missing (Python NAN, SQL null, Excel blank)
def how_many_NA(df):
    """Loop thru columns. Give number of missing NA values.
    
    his function tells you how many missing values there are in each column.

    if you add it up to azdias.info() non-null, you will get 891221 in each row.
    
    This is an extended version of the function above. It also calculates % of missing columns. 
    Returns a dataframe/report on NAN
    """
    missing_NA_list = []
    missing_NA_percent_list = []
    
    for column in df.columns.tolist():
        missing_NA = df[column].isna().sum()
        
        missing_NA_percent = missing_NA / len(df)
        
        missing_NA_list.append(missing_NA)
        missing_NA_percent_list.append(missing_NA_percent)
        
    missing_value_report_df = pd.DataFrame(
            {'Column': df.columns.tolist(),
             'missing_NA': missing_NA_list,
             'missing_NA_percent': missing_NA_percent_list
             }
            )
    return missing_value_report_df
In [25]:
azdias_NA_report = how_many_NA(azdias)
azdias_NA_report
Out[25]:
Column missing_NA missing_NA_percent
0 AGER_TYP 0 0.000000
1 ALTERSKATEGORIE_GROB 0 0.000000
2 ANREDE_KZ 0 0.000000
3 CJT_GESAMTTYP 4854 0.005446
4 FINANZ_MINIMALIST 0 0.000000
5 FINANZ_SPARER 0 0.000000
6 FINANZ_VORSORGER 0 0.000000
7 FINANZ_ANLEGER 0 0.000000
8 FINANZ_UNAUFFAELLIGER 0 0.000000
9 FINANZ_HAUSBAUER 0 0.000000
10 FINANZTYP 0 0.000000
11 GEBURTSJAHR 0 0.000000
12 GFK_URLAUBERTYP 4854 0.005446
13 GREEN_AVANTGARDE 0 0.000000
14 HEALTH_TYP 0 0.000000
15 LP_LEBENSPHASE_FEIN 4854 0.005446
16 LP_LEBENSPHASE_GROB 4854 0.005446
17 LP_FAMILIE_FEIN 4854 0.005446
18 LP_FAMILIE_GROB 4854 0.005446
19 LP_STATUS_FEIN 4854 0.005446
20 LP_STATUS_GROB 4854 0.005446
21 NATIONALITAET_KZ 0 0.000000
22 PRAEGENDE_JUGENDJAHRE 0 0.000000
23 RETOURTYP_BK_S 4854 0.005446
24 SEMIO_SOZ 0 0.000000
25 SEMIO_FAM 0 0.000000
26 SEMIO_REL 0 0.000000
27 SEMIO_MAT 0 0.000000
28 SEMIO_VERT 0 0.000000
29 SEMIO_LUST 0 0.000000
30 SEMIO_ERL 0 0.000000
31 SEMIO_KULT 0 0.000000
32 SEMIO_RAT 0 0.000000
33 SEMIO_KRIT 0 0.000000
34 SEMIO_DOM 0 0.000000
35 SEMIO_KAEM 0 0.000000
36 SEMIO_PFLICHT 0 0.000000
37 SEMIO_TRADV 0 0.000000
38 SHOPPER_TYP 0 0.000000
39 SOHO_KZ 73499 0.082470
40 TITEL_KZ 73499 0.082470
41 VERS_TYP 0 0.000000
42 ZABEOTYP 0 0.000000
43 ALTER_HH 73499 0.082470
44 ANZ_PERSONEN 73499 0.082470
45 ANZ_TITEL 73499 0.082470
46 HH_EINKOMMEN_SCORE 18348 0.020587
47 KK_KUNDENTYP 584612 0.655967
48 W_KEIT_KIND_HH 107602 0.120735
49 WOHNDAUER_2008 73499 0.082470
50 ANZ_HAUSHALTE_AKTIV 93148 0.104517
51 ANZ_HH_TITEL 97008 0.108848
52 GEBAEUDETYP 93148 0.104517
53 KONSUMNAEHE 73969 0.082997
54 MIN_GEBAEUDEJAHR 93148 0.104517
55 OST_WEST_KZ 93148 0.104517
56 WOHNLAGE 93148 0.104517
57 CAMEO_DEUG_2015 98979 0.111060
58 CAMEO_DEU_2015 98979 0.111060
59 CAMEO_INTL_2015 98979 0.111060
60 KBA05_ANTG1 133324 0.149597
61 KBA05_ANTG2 133324 0.149597
62 KBA05_ANTG3 133324 0.149597
63 KBA05_ANTG4 133324 0.149597
64 KBA05_BAUMAX 133324 0.149597
65 KBA05_GBZ 133324 0.149597
66 BALLRAUM 93740 0.105182
67 EWDICHTE 93740 0.105182
68 INNENSTADT 93740 0.105182
69 GEBAEUDETYP_RASTER 93155 0.104525
70 KKK 121196 0.135989
71 MOBI_REGIO 133324 0.149597
72 ONLINE_AFFINITAET 4854 0.005446
73 REGIOTYP 121196 0.135989
74 KBA13_ANZAHL_PKW 105800 0.118714
75 PLZ8_ANTG1 116515 0.130736
76 PLZ8_ANTG2 116515 0.130736
77 PLZ8_ANTG3 116515 0.130736
78 PLZ8_ANTG4 116515 0.130736
79 PLZ8_BAUMAX 116515 0.130736
80 PLZ8_HHZ 116515 0.130736
81 PLZ8_GBZ 116515 0.130736
82 ARBEIT 97216 0.109082
83 ORTSGR_KLS9 97216 0.109082
84 RELAT_AB 97216 0.109082
In [26]:
# how_many_NA?
In [27]:
# save to Excel
# azdias_NA_report.to_excel("azdias_NA_report.xlsx")
In [28]:
"""https://pythonspot.com/matplotlib-bar-chart/
https://stackoverflow.com/questions/12444716/how-do-i-set-the-figure-title-and-axes-labels-font-size-in-matplotlib
https://stackoverflow.com/questions/28022227/sorted-bar-charts-with-pandas-matplotlib-or-seaborn
"""
fig, ax = plt.subplots(1, 1,figsize=(15, 30))
plt.barh(azdias_NA_report['Column'], azdias_NA_report['missing_NA'], align='center', alpha=0.5)
plt.ylabel('Column', fontsize=18)
plt.xlabel('NAN', fontsize=18)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.title('NATURALLY missing data (NAN) in the ORIGINAL dataset', fontsize=18)

fmt = '{x:,.0f}'
tick = mtick.StrMethodFormatter(fmt)
ax.xaxis.set_major_formatter(tick) 
ax.invert_yaxis()

plt.show()
In [29]:
"""https://pythonspot.com/matplotlib-bar-chart/
https://stackoverflow.com/questions/12444716/how-do-i-set-the-figure-title-and-axes-labels-font-size-in-matplotlib
https://stackoverflow.com/questions/28022227/sorted-bar-charts-with-pandas-matplotlib-or-seaborn
"""
fig, ax = plt.subplots(1, 1,figsize=(15, 30))
plt.barh(azdias_NA_report['Column'], azdias_NA_report['missing_NA_percent'], align='center', alpha=0.5)
plt.ylabel('Column', fontsize=18)
plt.xlabel('NAN_percent', fontsize=18)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.title('NATURALLY missing data (NAN) in the ORIGINAL dataset - PERCENT', fontsize=18)

ax.invert_yaxis()

plt.show()
In [30]:
""" 
this is like VLOOKUP in Excel. Look up value from [missing_or_unknown] column of 
feat_info by value in [attribute] column of feat_info
https://stackoverflow.com/questions/18327624/find-elements-index-in-pandas-series

these values match feat_info.missing_or_unknown values

"""
for column in azdias.columns.tolist():
    print(feat_info.missing_or_unknown[feat_info[feat_info.attribute == column].index[0]])
[-1,0]
[-1,0,9]
[-1,0]
[0]
[-1]
[-1]
[-1]
[-1]
[-1]
[-1]
[-1]
[0]
[]
[]
[-1,0]
[0]
[0]
[0]
[0]
[0]
[0]
[-1,0]
[-1,0]
[0]
[-1,9]
[-1,9]
[-1,9]
[-1,9]
[-1,9]
[-1,9]
[-1,9]
[-1,9]
[-1,9]
[-1,9]
[-1,9]
[-1,9]
[-1,9]
[-1,9]
[-1]
[-1]
[-1,0]
[-1]
[-1,9]
[0]
[]
[]
[-1,0]
[-1]
[-1,0]
[-1,0]
[0]
[]
[-1,0]
[]
[0]
[-1]
[-1]
[-1,X]
[XX]
[-1,XX]
[-1]
[-1]
[-1]
[-1]
[-1,0]
[-1,0]
[-1]
[-1]
[-1]
[]
[-1,0]
[]
[]
[-1,0]
[]
[-1]
[-1]
[-1]
[-1]
[-1,0]
[-1]
[-1]
[-1,9]
[-1,0]
[-1,9]

Step 1.1.2: Assess Missing Data in Each Column

How much missing data is present in each column? There are a few columns that are outliers in terms of the proportion of values that are missing. You will want to use matplotlib's hist() function to visualize the distribution of missing value counts to find these columns. Identify and document these columns. While some of these columns might have justifications for keeping or re-encoding the data, for this project you should just remove them from the dataframe. (Feel free to make remarks about these outlier columns in the discussion, however!)

For the remaining features, are there any patterns in which columns have, or share, missing data?

Identify missing or unknown data values and convert them to NaNs.
In [31]:
def how_many_coded_uknown_or_missing(df):
    
    """"loop thru columns of df.
        Look at which specific values are codes for missing and uknown.
        How many values are coded as missing or unknown? Print report.
    """
    
    # initialize empty lists
    coded_unknown_or_missing_list = []
    coded_unknown_or_missing_percent_list = []
    i = 0

    # loop thu each column. Which values indicate missing and uknown?
    # how many values are coded as missing and uknown? (They would need to be replaced with NAN)
    for column in df.columns.tolist():
        
        # this is just the number of column
        print("Colum number:", i)
        i=i+1

        list_NAN_for_this_column = feat_info.missing_or_unknown[feat_info[feat_info.attribute == column].index[0]]

        print(column)
        
        print("missing' or 'unknown' code BEFORE parsing/cleaning the string")
        print(list_NAN_for_this_column)
        print(type(list_NAN_for_this_column))

        list_NAN_for_this_column = list_NAN_for_this_column.strip('[').strip(']').split(",")

        print("missing' or 'unknown' code AFTER parsing/cleaning the string")
        print(list_NAN_for_this_column)
        print(type(list_NAN_for_this_column))

        print("Sum of values in the original dataset which are coded as unknown or missing.")

        coded_unknown_or_missing = df[column].isin(list_NAN_for_this_column).sum()

        print(coded_unknown_or_missing)
        print("#############################################################################")
        print()
        
        
        # append sum of values are coded as missing and uknown to the list
        coded_unknown_or_missing_list.append(coded_unknown_or_missing)

        coded_unknown_or_missing_percent = coded_unknown_or_missing / len(df)

        coded_unknown_or_missing_percent_list.append(coded_unknown_or_missing_percent)    

    # create df
    
    coded_missing_unknown_report_df = pd.DataFrame(
                {
                 'Column': df.columns.tolist(),
                 'Coded_Uknown_or_Missing': coded_unknown_or_missing_list,
                 'Coded_Uknown_or_Missing_Percent': coded_unknown_or_missing_percent_list
                 }
                )
    
    return coded_missing_unknown_report_df  
In [32]:
azdias_coded_unknown_or_missing_report = how_many_coded_uknown_or_missing(azdias)
Colum number: 0
AGER_TYP
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1,0]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1', '0']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
685843
#############################################################################

Colum number: 1
ALTERSKATEGORIE_GROB
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1,0,9]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1', '0', '9']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
2881
#############################################################################

Colum number: 2
ANREDE_KZ
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1,0]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1', '0']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 3
CJT_GESAMTTYP
missing' or 'unknown' code BEFORE parsing/cleaning the string
[0]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['0']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 4
FINANZ_MINIMALIST
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 5
FINANZ_SPARER
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 6
FINANZ_VORSORGER
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 7
FINANZ_ANLEGER
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 8
FINANZ_UNAUFFAELLIGER
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 9
FINANZ_HAUSBAUER
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 10
FINANZTYP
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 11
GEBURTSJAHR
missing' or 'unknown' code BEFORE parsing/cleaning the string
[0]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['0']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
392318
#############################################################################

Colum number: 12
GFK_URLAUBERTYP
missing' or 'unknown' code BEFORE parsing/cleaning the string
[]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 13
GREEN_AVANTGARDE
missing' or 'unknown' code BEFORE parsing/cleaning the string
[]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 14
HEALTH_TYP
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1,0]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1', '0']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
111196
#############################################################################

Colum number: 15
LP_LEBENSPHASE_FEIN
missing' or 'unknown' code BEFORE parsing/cleaning the string
[0]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['0']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
92778
#############################################################################

Colum number: 16
LP_LEBENSPHASE_GROB
missing' or 'unknown' code BEFORE parsing/cleaning the string
[0]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['0']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
89718
#############################################################################

Colum number: 17
LP_FAMILIE_FEIN
missing' or 'unknown' code BEFORE parsing/cleaning the string
[0]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['0']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
72938
#############################################################################

Colum number: 18
LP_FAMILIE_GROB
missing' or 'unknown' code BEFORE parsing/cleaning the string
[0]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['0']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
72938
#############################################################################

Colum number: 19
LP_STATUS_FEIN
missing' or 'unknown' code BEFORE parsing/cleaning the string
[0]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['0']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 20
LP_STATUS_GROB
missing' or 'unknown' code BEFORE parsing/cleaning the string
[0]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['0']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 21
NATIONALITAET_KZ
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1,0]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1', '0']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
108315
#############################################################################

Colum number: 22
PRAEGENDE_JUGENDJAHRE
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1,0]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1', '0']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
108164
#############################################################################

Colum number: 23
RETOURTYP_BK_S
missing' or 'unknown' code BEFORE parsing/cleaning the string
[0]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['0']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 24
SEMIO_SOZ
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1,9]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1', '9']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 25
SEMIO_FAM
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1,9]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1', '9']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 26
SEMIO_REL
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1,9]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1', '9']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 27
SEMIO_MAT
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1,9]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1', '9']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 28
SEMIO_VERT
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1,9]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1', '9']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 29
SEMIO_LUST
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1,9]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1', '9']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 30
SEMIO_ERL
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1,9]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1', '9']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 31
SEMIO_KULT
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1,9]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1', '9']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 32
SEMIO_RAT
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1,9]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1', '9']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 33
SEMIO_KRIT
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1,9]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1', '9']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 34
SEMIO_DOM
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1,9]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1', '9']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 35
SEMIO_KAEM
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1,9]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1', '9']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 36
SEMIO_PFLICHT
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1,9]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1', '9']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 37
SEMIO_TRADV
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1,9]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1', '9']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 38
SHOPPER_TYP
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
111196
#############################################################################

Colum number: 39
SOHO_KZ
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 40
TITEL_KZ
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1,0]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1', '0']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
815562
#############################################################################

Colum number: 41
VERS_TYP
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
111196
#############################################################################

Colum number: 42
ZABEOTYP
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1,9]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1', '9']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 43
ALTER_HH
missing' or 'unknown' code BEFORE parsing/cleaning the string
[0]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['0']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
236768
#############################################################################

Colum number: 44
ANZ_PERSONEN
missing' or 'unknown' code BEFORE parsing/cleaning the string
[]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 45
ANZ_TITEL
missing' or 'unknown' code BEFORE parsing/cleaning the string
[]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 46
HH_EINKOMMEN_SCORE
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1,0]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1', '0']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 47
KK_KUNDENTYP
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 48
W_KEIT_KIND_HH
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1,0]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1', '0']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
40386
#############################################################################

Colum number: 49
WOHNDAUER_2008
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1,0]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1', '0']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 50
ANZ_HAUSHALTE_AKTIV
missing' or 'unknown' code BEFORE parsing/cleaning the string
[0]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['0']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
6463
#############################################################################

Colum number: 51
ANZ_HH_TITEL
missing' or 'unknown' code BEFORE parsing/cleaning the string
[]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 52
GEBAEUDETYP
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1,0]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1', '0']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 53
KONSUMNAEHE
missing' or 'unknown' code BEFORE parsing/cleaning the string
[]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 54
MIN_GEBAEUDEJAHR
missing' or 'unknown' code BEFORE parsing/cleaning the string
[0]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['0']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 55
OST_WEST_KZ
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 56
WOHNLAGE
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 57
CAMEO_DEUG_2015
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1,X]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1', 'X']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
373
#############################################################################

Colum number: 58
CAMEO_DEU_2015
missing' or 'unknown' code BEFORE parsing/cleaning the string
[XX]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['XX']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
373
#############################################################################

Colum number: 59
CAMEO_INTL_2015
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1,XX]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1', 'XX']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
373
#############################################################################

Colum number: 60
KBA05_ANTG1
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 61
KBA05_ANTG2
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 62
KBA05_ANTG3
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 63
KBA05_ANTG4
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 64
KBA05_BAUMAX
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1,0]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1', '0']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
343200
#############################################################################

Colum number: 65
KBA05_GBZ
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1,0]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1', '0']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 66
BALLRAUM
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 67
EWDICHTE
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 68
INNENSTADT
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 69
GEBAEUDETYP_RASTER
missing' or 'unknown' code BEFORE parsing/cleaning the string
[]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 70
KKK
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1,0]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1', '0']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
36868
#############################################################################

Colum number: 71
MOBI_REGIO
missing' or 'unknown' code BEFORE parsing/cleaning the string
[]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 72
ONLINE_AFFINITAET
missing' or 'unknown' code BEFORE parsing/cleaning the string
[]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 73
REGIOTYP
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1,0]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1', '0']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
36868
#############################################################################

Colum number: 74
KBA13_ANZAHL_PKW
missing' or 'unknown' code BEFORE parsing/cleaning the string
[]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 75
PLZ8_ANTG1
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 76
PLZ8_ANTG2
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 77
PLZ8_ANTG3
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 78
PLZ8_ANTG4
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 79
PLZ8_BAUMAX
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1,0]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1', '0']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 80
PLZ8_HHZ
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 81
PLZ8_GBZ
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
0
#############################################################################

Colum number: 82
ARBEIT
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1,9]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1', '9']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
159
#############################################################################

Colum number: 83
ORTSGR_KLS9
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1,0]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1', '0']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
58
#############################################################################

Colum number: 84
RELAT_AB
missing' or 'unknown' code BEFORE parsing/cleaning the string
[-1,9]
<class 'str'>
missing' or 'unknown' code AFTER parsing/cleaning the string
['-1', '9']
<class 'list'>
Sum of values in the original dataset which are coded as unknown or missing.
159
#############################################################################

In [33]:
# azdias_coded_unknown_or_missing_report.to_excel("azdias_coded_unknown_or_missing_report.xlsx")
In [34]:
azdias_coded_unknown_or_missing_report
Out[34]:
Column Coded_Uknown_or_Missing Coded_Uknown_or_Missing_Percent
0 AGER_TYP 685843 0.769554
1 ALTERSKATEGORIE_GROB 2881 0.003233
2 ANREDE_KZ 0 0.000000
3 CJT_GESAMTTYP 0 0.000000
4 FINANZ_MINIMALIST 0 0.000000
5 FINANZ_SPARER 0 0.000000
6 FINANZ_VORSORGER 0 0.000000
7 FINANZ_ANLEGER 0 0.000000
8 FINANZ_UNAUFFAELLIGER 0 0.000000
9 FINANZ_HAUSBAUER 0 0.000000
10 FINANZTYP 0 0.000000
11 GEBURTSJAHR 392318 0.440203
12 GFK_URLAUBERTYP 0 0.000000
13 GREEN_AVANTGARDE 0 0.000000
14 HEALTH_TYP 111196 0.124768
15 LP_LEBENSPHASE_FEIN 92778 0.104102
16 LP_LEBENSPHASE_GROB 89718 0.100669
17 LP_FAMILIE_FEIN 72938 0.081841
18 LP_FAMILIE_GROB 72938 0.081841
19 LP_STATUS_FEIN 0 0.000000
20 LP_STATUS_GROB 0 0.000000
21 NATIONALITAET_KZ 108315 0.121536
22 PRAEGENDE_JUGENDJAHRE 108164 0.121366
23 RETOURTYP_BK_S 0 0.000000
24 SEMIO_SOZ 0 0.000000
25 SEMIO_FAM 0 0.000000
26 SEMIO_REL 0 0.000000
27 SEMIO_MAT 0 0.000000
28 SEMIO_VERT 0 0.000000
29 SEMIO_LUST 0 0.000000
30 SEMIO_ERL 0 0.000000
31 SEMIO_KULT 0 0.000000
32 SEMIO_RAT 0 0.000000
33 SEMIO_KRIT 0 0.000000
34 SEMIO_DOM 0 0.000000
35 SEMIO_KAEM 0 0.000000
36 SEMIO_PFLICHT 0 0.000000
37 SEMIO_TRADV 0 0.000000
38 SHOPPER_TYP 111196 0.124768
39 SOHO_KZ 0 0.000000
40 TITEL_KZ 815562 0.915106
41 VERS_TYP 111196 0.124768
42 ZABEOTYP 0 0.000000
43 ALTER_HH 236768 0.265667
44 ANZ_PERSONEN 0 0.000000
45 ANZ_TITEL 0 0.000000
46 HH_EINKOMMEN_SCORE 0 0.000000
47 KK_KUNDENTYP 0 0.000000
48 W_KEIT_KIND_HH 40386 0.045315
49 WOHNDAUER_2008 0 0.000000
50 ANZ_HAUSHALTE_AKTIV 6463 0.007252
51 ANZ_HH_TITEL 0 0.000000
52 GEBAEUDETYP 0 0.000000
53 KONSUMNAEHE 0 0.000000
54 MIN_GEBAEUDEJAHR 0 0.000000
55 OST_WEST_KZ 0 0.000000
56 WOHNLAGE 0 0.000000
57 CAMEO_DEUG_2015 373 0.000419
58 CAMEO_DEU_2015 373 0.000419
59 CAMEO_INTL_2015 373 0.000419
60 KBA05_ANTG1 0 0.000000
61 KBA05_ANTG2 0 0.000000
62 KBA05_ANTG3 0 0.000000
63 KBA05_ANTG4 0 0.000000
64 KBA05_BAUMAX 343200 0.385090
65 KBA05_GBZ 0 0.000000
66 BALLRAUM 0 0.000000
67 EWDICHTE 0 0.000000
68 INNENSTADT 0 0.000000
69 GEBAEUDETYP_RASTER 0 0.000000
70 KKK 36868 0.041368
71 MOBI_REGIO 0 0.000000
72 ONLINE_AFFINITAET 0 0.000000
73 REGIOTYP 36868 0.041368
74 KBA13_ANZAHL_PKW 0 0.000000
75 PLZ8_ANTG1 0 0.000000
76 PLZ8_ANTG2 0 0.000000
77 PLZ8_ANTG3 0 0.000000
78 PLZ8_ANTG4 0 0.000000
79 PLZ8_BAUMAX 0 0.000000
80 PLZ8_HHZ 0 0.000000
81 PLZ8_GBZ 0 0.000000
82 ARBEIT 159 0.000178
83 ORTSGR_KLS9 58 0.000065
84 RELAT_AB 159 0.000178
In [35]:
"""https://pythonspot.com/matplotlib-bar-chart/
https://stackoverflow.com/questions/12444716/how-do-i-set-the-figure-title-and-axes-labels-font-size-in-matplotlib
https://stackoverflow.com/questions/28022227/sorted-bar-charts-with-pandas-matplotlib-or-seaborn
"""
fig, ax = plt.subplots(1, 1,figsize=(15, 30))
plt.barh(azdias_coded_unknown_or_missing_report['Column'], azdias_coded_unknown_or_missing_report['Coded_Uknown_or_Missing'], align='center', alpha=0.5)
plt.ylabel('Column', fontsize=18)
plt.xlabel('Coded_Uknown_or_Missing', fontsize=18)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.title('Data CODED AS UNKNOWN OR MISSING in the ORIGINAL dataset', fontsize=18)

fmt = '{x:,.0f}'
tick = mtick.StrMethodFormatter(fmt)
ax.xaxis.set_major_formatter(tick) 
ax.invert_yaxis()

plt.show()
In [36]:
def replace_coded_as_missing_unknown_with_NANs(df):
    
    """
    copy dataframe 
    
    in the copy, replace missing and uknown codes with NAN
    
    int64 gets replaced with float64, not sure why, probably cus NAN corresponds to a float
    
    https://stackoverflow.com/questions/53819909/pandas-replace-values-in-dataframe-conditionally-based-on-string-compare
    
    https://stackoverflow.com/questions/41870093/pandas-shift-converts-my-column-from-integer-to-float
    
    """
    
    # create a copy
    df_cleaned = df.copy()
    
    for column in df_cleaned.columns.tolist():
        
        # which values in this columnn are indicating missing or uknown?
        list_NAN_for_this_column = feat_info.missing_or_unknown[feat_info[feat_info.attribute == column].index[0]]
       
        # clean up the missing and unkown codes
        list_NAN_for_this_column = list_NAN_for_this_column.strip('[').strip(']').split(",")
                
        # replace with NAN. 2 methods. Link in docustring from Stack Overflow. Both work well and been tested
        
#        df_cleaned[column] = np.where(df_cleaned[column].isin(list_NAN_for_this_column), np.nan, df_cleaned[column])
        
        df_cleaned.loc[df_cleaned[column].isin(list_NAN_for_this_column), column] = np.nan
        
    return df_cleaned
In [37]:
azdias_cleaned = replace_coded_as_missing_unknown_with_NANs(azdias)  
In [38]:
def check_column(df, df_cleaned, column):
    """check if i replaced the right values with NAN"""
    print("delete these values:", feat_info.missing_or_unknown[feat_info[feat_info.attribute == column].index[0]])
    print()
    print(df[column].value_counts())
    print()
    print(df_cleaned[column].value_counts())
In [39]:
check_column(azdias, azdias_cleaned, 'AGER_TYP')
delete these values: [-1,0]

-1    677503
 2     98472
 1     79802
 3     27104
 0      8340
Name: AGER_TYP, dtype: int64

2.0    98472
1.0    79802
3.0    27104
Name: AGER_TYP, dtype: int64
In [40]:
for column in ['ALTERSKATEGORIE_GROB', 'ORTSGR_KLS9', 'RELAT_AB', 'ANREDE_KZ']:
    check_column(azdias, azdias_cleaned, column)
    print('################################')
    print()
delete these values: [-1,0,9]

3    358533
4    228510
2    158410
1    142887
9      2881
Name: ALTERSKATEGORIE_GROB, dtype: int64

3.0    358533
4.0    228510
2.0    158410
1.0    142887
Name: ALTERSKATEGORIE_GROB, dtype: int64
################################

delete these values: [-1,0]

5.0    148096
4.0    114909
7.0    102866
9.0     91879
3.0     83542
6.0     75995
8.0     72709
2.0     63362
1.0     40589
0.0        58
Name: ORTSGR_KLS9, dtype: int64

5.0    148096
4.0    114909
7.0    102866
9.0     91879
3.0     83542
6.0     75995
8.0     72709
2.0     63362
1.0     40589
Name: ORTSGR_KLS9, dtype: int64
################################

delete these values: [-1,9]

3.0    274008
5.0    174964
1.0    142907
2.0    104846
4.0     97121
9.0       159
Name: RELAT_AB, dtype: int64

3.0    274008
5.0    174964
1.0    142907
2.0    104846
4.0     97121
Name: RELAT_AB, dtype: int64
################################

delete these values: [-1,0]

2    465305
1    425916
Name: ANREDE_KZ, dtype: int64

2.0    465305
1.0    425916
Name: ANREDE_KZ, dtype: int64
################################

In [41]:
# Perform an assessment of how much missing data there is in each column of the
# dataset.
azdias_cleaned_NA_report = how_many_NA(azdias_cleaned)
azdias_cleaned_NA_report
Out[41]:
Column missing_NA missing_NA_percent
0 AGER_TYP 685843 0.769554
1 ALTERSKATEGORIE_GROB 2881 0.003233
2 ANREDE_KZ 0 0.000000
3 CJT_GESAMTTYP 4854 0.005446
4 FINANZ_MINIMALIST 0 0.000000
5 FINANZ_SPARER 0 0.000000
6 FINANZ_VORSORGER 0 0.000000
7 FINANZ_ANLEGER 0 0.000000
8 FINANZ_UNAUFFAELLIGER 0 0.000000
9 FINANZ_HAUSBAUER 0 0.000000
10 FINANZTYP 0 0.000000
11 GEBURTSJAHR 392318 0.440203
12 GFK_URLAUBERTYP 4854 0.005446
13 GREEN_AVANTGARDE 0 0.000000
14 HEALTH_TYP 111196 0.124768
15 LP_LEBENSPHASE_FEIN 97632 0.109549
16 LP_LEBENSPHASE_GROB 94572 0.106115
17 LP_FAMILIE_FEIN 77792 0.087287
18 LP_FAMILIE_GROB 77792 0.087287
19 LP_STATUS_FEIN 4854 0.005446
20 LP_STATUS_GROB 4854 0.005446
21 NATIONALITAET_KZ 108315 0.121536
22 PRAEGENDE_JUGENDJAHRE 108164 0.121366
23 RETOURTYP_BK_S 4854 0.005446
24 SEMIO_SOZ 0 0.000000
25 SEMIO_FAM 0 0.000000
26 SEMIO_REL 0 0.000000
27 SEMIO_MAT 0 0.000000
28 SEMIO_VERT 0 0.000000
29 SEMIO_LUST 0 0.000000
30 SEMIO_ERL 0 0.000000
31 SEMIO_KULT 0 0.000000
32 SEMIO_RAT 0 0.000000
33 SEMIO_KRIT 0 0.000000
34 SEMIO_DOM 0 0.000000
35 SEMIO_KAEM 0 0.000000
36 SEMIO_PFLICHT 0 0.000000
37 SEMIO_TRADV 0 0.000000
38 SHOPPER_TYP 111196 0.124768
39 SOHO_KZ 73499 0.082470
40 TITEL_KZ 889061 0.997576
41 VERS_TYP 111196 0.124768
42 ZABEOTYP 0 0.000000
43 ALTER_HH 310267 0.348137
44 ANZ_PERSONEN 73499 0.082470
45 ANZ_TITEL 73499 0.082470
46 HH_EINKOMMEN_SCORE 18348 0.020587
47 KK_KUNDENTYP 584612 0.655967
48 W_KEIT_KIND_HH 147988 0.166051
49 WOHNDAUER_2008 73499 0.082470
50 ANZ_HAUSHALTE_AKTIV 99611 0.111769
51 ANZ_HH_TITEL 97008 0.108848
52 GEBAEUDETYP 93148 0.104517
53 KONSUMNAEHE 73969 0.082997
54 MIN_GEBAEUDEJAHR 93148 0.104517
55 OST_WEST_KZ 93148 0.104517
56 WOHNLAGE 93148 0.104517
57 CAMEO_DEUG_2015 99352 0.111479
58 CAMEO_DEU_2015 99352 0.111479
59 CAMEO_INTL_2015 99352 0.111479
60 KBA05_ANTG1 133324 0.149597
61 KBA05_ANTG2 133324 0.149597
62 KBA05_ANTG3 133324 0.149597
63 KBA05_ANTG4 133324 0.149597
64 KBA05_BAUMAX 476524 0.534687
65 KBA05_GBZ 133324 0.149597
66 BALLRAUM 93740 0.105182
67 EWDICHTE 93740 0.105182
68 INNENSTADT 93740 0.105182
69 GEBAEUDETYP_RASTER 93155 0.104525
70 KKK 158064 0.177357
71 MOBI_REGIO 133324 0.149597
72 ONLINE_AFFINITAET 4854 0.005446
73 REGIOTYP 158064 0.177357
74 KBA13_ANZAHL_PKW 105800 0.118714
75 PLZ8_ANTG1 116515 0.130736
76 PLZ8_ANTG2 116515 0.130736
77 PLZ8_ANTG3 116515 0.130736
78 PLZ8_ANTG4 116515 0.130736
79 PLZ8_BAUMAX 116515 0.130736
80 PLZ8_HHZ 116515 0.130736
81 PLZ8_GBZ 116515 0.130736
82 ARBEIT 97375 0.109260
83 ORTSGR_KLS9 97274 0.109147
84 RELAT_AB 97375 0.109260
In [42]:
# azdias_cleaned_NA_report.to_excel("azdias_cleaned_NA_report.xlsx")
In [43]:
# Investigate patterns in the amount of missing data in each column.

"""https://pythonspot.com/matplotlib-bar-chart/
https://stackoverflow.com/questions/12444716/how-do-i-set-the-figure-title-and-axes-labels-font-size-in-matplotlib
https://stackoverflow.com/questions/28022227/sorted-bar-charts-with-pandas-matplotlib-or-seaborn
"""
fig, ax = plt.subplots(1, 1,figsize=(15, 30))
plt.barh(azdias_cleaned_NA_report['Column'], azdias_cleaned_NA_report['missing_NA'], align='center', alpha=0.5)
plt.ylabel('Column', fontsize=18)
plt.xlabel('NAN', fontsize=18)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.title('MISSING/NAN data in the dataset AFTER replacing missing/uknown codes with NAN', fontsize=18)

fmt = '{x:,.0f}'
tick = mtick.StrMethodFormatter(fmt)
ax.xaxis.set_major_formatter(tick) 
ax.invert_yaxis()

plt.show()
In [44]:
# Investigate patterns in the amount of missing data in each column.

"""

red bars indicate columns with more than 33% missing data

https://pythonspot.com/matplotlib-bar-chart/

https://stackoverflow.com/questions/12444716/how-do-i-set-the-figure-title-and-axes-labels-font-size-in-matplotlib

https://stackoverflow.com/questions/28022227/sorted-bar-charts-with-pandas-matplotlib-or-seaborn

https://stackoverflow.com/questions/22341271/get-list-from-pandas-dataframe-column

https://stackoverflow.com/questions/3832809/how-to-change-the-color-of-a-single-bar-if-condition-is-true-matplotlib

"""

##############################################################################

# in a visualization below, outliers will be marked in red. 
# Outlier is a column wiht more than 33% missing data
condition = azdias_cleaned_NA_report['missing_NA_percent']>0.33
 
azdias_cleaned_NA_report['colour'] = np.where(condition==True, 'red', 'blue')

##############################################################################


fig, ax = plt.subplots(1, 1,figsize=(15, 30))

clrs = list(azdias_cleaned_NA_report['colour'])
clrs

plt.barh(azdias_cleaned_NA_report['Column'], azdias_cleaned_NA_report['missing_NA_percent'], align='center', alpha=0.5, color = clrs)
plt.ylabel('Column', fontsize=18)
plt.xlabel('NAN_percent', fontsize=18)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.title('MISSING/NAN data in the dataset after replacing missing/unknown codes with NAN - PERCENT', fontsize=18)

ax.invert_yaxis()

plt.show()
In [45]:
# Investigate patterns in the amount of missing data in each column.
#  the following columns have more than 1/3 of data missing
outliers_df = azdias_cleaned_NA_report[azdias_cleaned_NA_report['missing_NA_percent']>0.333]
outliers_df
Out[45]:
Column missing_NA missing_NA_percent colour
0 AGER_TYP 685843 0.769554 red
11 GEBURTSJAHR 392318 0.440203 red
40 TITEL_KZ 889061 0.997576 red
43 ALTER_HH 310267 0.348137 red
47 KK_KUNDENTYP 584612 0.655967 red
64 KBA05_BAUMAX 476524 0.534687 red
In [46]:
# Remove the outlier columns from the dataset. (You'll perform other data
# engineering tasks such as re-encoding and imputation later.)

print("BEFORE removing outlier colums, shape: ", azdias_cleaned.shape)

for outlier_column in outliers_df['Column'].tolist():
    del(azdias_cleaned[outlier_column])
    
print("AFTER removing outlier colums, shape: ", azdias_cleaned.shape)   
BEFORE removing outlier colums, shape:  (891221, 85)
AFTER removing outlier colums, shape:  (891221, 79)

Removed 6 columns with too much missing data

In [47]:
azdias_cleaned_NA_report = how_many_NA(azdias_cleaned)
azdias_cleaned_NA_report
Out[47]:
Column missing_NA missing_NA_percent
0 ALTERSKATEGORIE_GROB 2881 0.003233
1 ANREDE_KZ 0 0.000000
2 CJT_GESAMTTYP 4854 0.005446
3 FINANZ_MINIMALIST 0 0.000000
4 FINANZ_SPARER 0 0.000000
5 FINANZ_VORSORGER 0 0.000000
6 FINANZ_ANLEGER 0 0.000000
7 FINANZ_UNAUFFAELLIGER 0 0.000000
8 FINANZ_HAUSBAUER 0 0.000000
9 FINANZTYP 0 0.000000
10 GFK_URLAUBERTYP 4854 0.005446
11 GREEN_AVANTGARDE 0 0.000000
12 HEALTH_TYP 111196 0.124768
13 LP_LEBENSPHASE_FEIN 97632 0.109549
14 LP_LEBENSPHASE_GROB 94572 0.106115
15 LP_FAMILIE_FEIN 77792 0.087287
16 LP_FAMILIE_GROB 77792 0.087287
17 LP_STATUS_FEIN 4854 0.005446
18 LP_STATUS_GROB 4854 0.005446
19 NATIONALITAET_KZ 108315 0.121536
20 PRAEGENDE_JUGENDJAHRE 108164 0.121366
21 RETOURTYP_BK_S 4854 0.005446
22 SEMIO_SOZ 0 0.000000
23 SEMIO_FAM 0 0.000000
24 SEMIO_REL 0 0.000000
25 SEMIO_MAT 0 0.000000
26 SEMIO_VERT 0 0.000000
27 SEMIO_LUST 0 0.000000
28 SEMIO_ERL 0 0.000000
29 SEMIO_KULT 0 0.000000
30 SEMIO_RAT 0 0.000000
31 SEMIO_KRIT 0 0.000000
32 SEMIO_DOM 0 0.000000
33 SEMIO_KAEM 0 0.000000
34 SEMIO_PFLICHT 0 0.000000
35 SEMIO_TRADV 0 0.000000
36 SHOPPER_TYP 111196 0.124768
37 SOHO_KZ 73499 0.082470
38 VERS_TYP 111196 0.124768
39 ZABEOTYP 0 0.000000
40 ANZ_PERSONEN 73499 0.082470
41 ANZ_TITEL 73499 0.082470
42 HH_EINKOMMEN_SCORE 18348 0.020587
43 W_KEIT_KIND_HH 147988 0.166051
44 WOHNDAUER_2008 73499 0.082470
45 ANZ_HAUSHALTE_AKTIV 99611 0.111769
46 ANZ_HH_TITEL 97008 0.108848
47 GEBAEUDETYP 93148 0.104517
48 KONSUMNAEHE 73969 0.082997
49 MIN_GEBAEUDEJAHR 93148 0.104517
50 OST_WEST_KZ 93148 0.104517
51 WOHNLAGE 93148 0.104517
52 CAMEO_DEUG_2015 99352 0.111479
53 CAMEO_DEU_2015 99352 0.111479
54 CAMEO_INTL_2015 99352 0.111479
55 KBA05_ANTG1 133324 0.149597
56 KBA05_ANTG2 133324 0.149597
57 KBA05_ANTG3 133324 0.149597
58 KBA05_ANTG4 133324 0.149597
59 KBA05_GBZ 133324 0.149597
60 BALLRAUM 93740 0.105182
61 EWDICHTE 93740 0.105182
62 INNENSTADT 93740 0.105182
63 GEBAEUDETYP_RASTER 93155 0.104525
64 KKK 158064 0.177357
65 MOBI_REGIO 133324 0.149597
66 ONLINE_AFFINITAET 4854 0.005446
67 REGIOTYP 158064 0.177357
68 KBA13_ANZAHL_PKW 105800 0.118714
69 PLZ8_ANTG1 116515 0.130736
70 PLZ8_ANTG2 116515 0.130736
71 PLZ8_ANTG3 116515 0.130736
72 PLZ8_ANTG4 116515 0.130736
73 PLZ8_BAUMAX 116515 0.130736
74 PLZ8_HHZ 116515 0.130736
75 PLZ8_GBZ 116515 0.130736
76 ARBEIT 97375 0.109260
77 ORTSGR_KLS9 97274 0.109147
78 RELAT_AB 97375 0.109260
In [48]:
"""https://pythonspot.com/matplotlib-bar-chart/

https://stackoverflow.com/questions/12444716/how-do-i-set-the-figure-title-and-axes-labels-font-size-in-matplotlib

https://stackoverflow.com/questions/28022227/sorted-bar-charts-with-pandas-matplotlib-or-seaborn

"""

fig, ax = plt.subplots(1, 1,figsize=(15, 30))
plt.barh(azdias_cleaned_NA_report['Column'], azdias_cleaned_NA_report['missing_NA_percent'], align='center', alpha=0.5)
plt.ylabel('Column', fontsize=18)
plt.xlabel('NAN_percent', fontsize=18)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.title('MISSING/NAN data in the dataset - deleted NAN outlier columns', fontsize=18)

ax.invert_yaxis()

plt.show()

Discussion 1.1.2: Assess Missing Data in Each Column

(Double click this cell and replace this text with your own text, reporting your observations regarding the amount of missing data in each column. Are there any patterns in missing values? Which columns were removed from the dataset?)

The following columns have been removed. They all have more than 33% missing data

  • AGER_TYP
  • GEBURTSJAHR
  • TITEL_KZ
  • ALTER_HH
  • KK_KUNDENTYP
  • KBA05_BAUMAX

3 of them are related to age: AGER_TYP, GEBURTSJAHR and ALTER_HH.

KK_KUNDERTYP has the largest amount of natually missing (NA) data (66%).

AGE_TYP and TITEL_KUZ have the largest amount of data coded as missing or unknown.

An odd thing is that, once you remove the outlier columns, if you sort all variables alphabetically, then variables with missing values are grouped together in 3 groups. You can see this above in the chart MISSING/NAN data in the dataset - deleted NAN outlier columns. This could be just happenstance. It's like we have 3 groups or clusters. This is just a metaphor/comparison I use here, I know it's not a cluster in the statistical sense of the word.

image.png

Step 1.1.3: Assess Missing Data in Each Row

Now, you'll perform a similar assessment for the rows of the dataset. How much data is missing in each row? As with the columns, you should see some groups of points that have a very different numbers of missing values. Divide the data into two subsets: one for data points that are above some threshold for missing values, and a second subset for points below that threshold.

In order to know what to do with the outlier rows, we should see if the distribution of data values on columns that are not missing data (or are missing very little data) are similar or different between the two groups. Select at least five of these columns and compare the distribution of values.

  • You can use seaborn's countplot() function to create a bar chart of code frequencies and matplotlib's subplot() function to put bar charts for the two subplots side by side.
  • To reduce repeated code, you might want to write a function that can perform this comparison, taking as one of its arguments a column to be compared.

Depending on what you observe in your comparison, this will have implications on how you approach your conclusions later in the analysis. If the distributions of non-missing features look similar between the data with many missing values and the data with few or no missing values, then we could argue that simply dropping those points from the analysis won't present a major issue. On the other hand, if the data with many missing values looks very different from the data with few or no missing values, then we should make a note on those data as special. We'll revisit these data later on. Either way, you should continue your analysis for now using just the subset of the data with few or no missing values.

In [49]:
def count_NAN_in_each_ROW(df):
    """
    Count number of missing values in each row.
    Add a column NAN_count.
    https://datascience.stackexchange.com/questions/12645/how-to-count-the-number-of-missing-values-in-each-row-in-pandas-dataframe"""
    # number of columns MINUS smth like Excel COUNTA function (should count both text and numbers)
    df['NAN_count'] = df.shape[1] - df.apply(lambda x: x.count(), axis=1)
    
# test the function above
test = azdias_cleaned.sample(n=100, random_state=42)

count_NAN_in_each_ROW(test)

# test.to_excel("test.xlsx")
In [50]:
# test
In [51]:
# TESTING THE FUNCTION ABOVE - DONT USE FOR ANALYSIS
# sns.set(style="darkgrid")
# fig, ax = plt.subplots(1, 1,figsize=(20, 10))
# sns.countplot(x="NAN_count", data=test, color = 'blue', alpha=0.5)

# plt.ylabel('Frequency', fontsize=18)
# plt.xlabel('NAN Values in Row', fontsize=18)
# plt.xticks(fontsize=15)
# plt.yticks(fontsize=15)
# plt.title('Missing Data in Rows TEST DONT USE FOR ANALYSIS', fontsize=18)
# plt.show()
In [52]:
"""https://stackoverflow.com/questions/20461165/how-to-convert-index-of-a-pandas-dataframe-into-a-column

https://pandas.pydata.org/pandas-docs/version/0.18/generated/pandas.DataFrame.sort_values.html#pandas.DataFrame.sort_values"""

def row_wise_NA_count_summary_report(df):
    row_wise_NA_count = df['NAN_count'].value_counts().to_frame()
    
    row_wise_NA_count= row_wise_NA_count.reset_index()
    
    row_wise_NA_count.columns = ['NAN Count in Row', 'Frequency']
    row_wise_NA_count=row_wise_NA_count.sort_values(by ='NAN Count in Row')
    
    return row_wise_NA_count
In [53]:
# """TEST - DONT USE FOR ANALYSIS"""

# row_wise_NA_count_summary_report_test = row_wise_NA_count_summary_report(test)

# row_wise_NA_count_summary_report_test
In [54]:
# this takes a while to run, but not too bad, just like 3-5 mins
count_NAN_in_each_ROW(azdias_cleaned)
In [55]:
# I added a column at the very end. NAN_count
azdias_cleaned.head(50)
Out[55]:
ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER FINANZTYP GFK_URLAUBERTYP GREEN_AVANTGARDE HEALTH_TYP LP_LEBENSPHASE_FEIN LP_LEBENSPHASE_GROB LP_FAMILIE_FEIN LP_FAMILIE_GROB LP_STATUS_FEIN LP_STATUS_GROB NATIONALITAET_KZ PRAEGENDE_JUGENDJAHRE RETOURTYP_BK_S SEMIO_SOZ SEMIO_FAM SEMIO_REL SEMIO_MAT SEMIO_VERT SEMIO_LUST SEMIO_ERL SEMIO_KULT SEMIO_RAT SEMIO_KRIT SEMIO_DOM SEMIO_KAEM SEMIO_PFLICHT SEMIO_TRADV SHOPPER_TYP SOHO_KZ VERS_TYP ZABEOTYP ANZ_PERSONEN ANZ_TITEL HH_EINKOMMEN_SCORE W_KEIT_KIND_HH WOHNDAUER_2008 ANZ_HAUSHALTE_AKTIV ANZ_HH_TITEL GEBAEUDETYP KONSUMNAEHE MIN_GEBAEUDEJAHR OST_WEST_KZ WOHNLAGE CAMEO_DEUG_2015 CAMEO_DEU_2015 CAMEO_INTL_2015 KBA05_ANTG1 KBA05_ANTG2 KBA05_ANTG3 KBA05_ANTG4 KBA05_GBZ BALLRAUM EWDICHTE INNENSTADT GEBAEUDETYP_RASTER KKK MOBI_REGIO ONLINE_AFFINITAET REGIOTYP KBA13_ANZAHL_PKW PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB NAN_count
0 2.0 1.0 2.0 3.0 4.0 3.0 5.0 5.0 3.0 4.0 10.0 0.0 NaN 15.0 4.0 2.0 2.0 1.0 1.0 NaN NaN 5.0 2.0 6.0 7.0 5.0 1.0 5.0 3.0 3.0 4.0 7.0 6.0 6.0 5.0 3.0 NaN NaN NaN 3.0 NaN NaN 2.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 43
1 1.0 2.0 5.0 1.0 5.0 2.0 5.0 4.0 5.0 1.0 10.0 0.0 3.0 21.0 6.0 5.0 3.0 2.0 1.0 1.0 14.0 1.0 5.0 4.0 4.0 3.0 1.0 2.0 2.0 3.0 6.0 4.0 7.0 4.0 7.0 6.0 3.0 1.0 2.0 5.0 2.0 0.0 6.0 3.0 9.0 11.0 0.0 8.0 1.0 1992.0 W 4.0 8 8A 51 0.0 0.0 0.0 2.0 1.0 6.0 3.0 8.0 3.0 2.0 1.0 3.0 3.0 963.0 2.0 3.0 2.0 1.0 1.0 5.0 4.0 3.0 5.0 4.0 0
2 3.0 2.0 3.0 1.0 4.0 1.0 2.0 3.0 5.0 1.0 10.0 1.0 3.0 3.0 1.0 1.0 1.0 3.0 2.0 1.0 15.0 3.0 4.0 1.0 3.0 3.0 4.0 4.0 6.0 3.0 4.0 7.0 7.0 7.0 3.0 3.0 2.0 0.0 1.0 5.0 1.0 0.0 4.0 3.0 9.0 10.0 0.0 1.0 5.0 1992.0 W 2.0 4 4C 24 1.0 3.0 1.0 0.0 3.0 2.0 4.0 4.0 4.0 2.0 3.0 2.0 2.0 712.0 3.0 3.0 1.0 0.0 1.0 4.0 4.0 3.0 5.0 2.0 0
3 4.0 2.0 2.0 4.0 2.0 5.0 2.0 1.0 2.0 6.0 1.0 0.0 2.0 NaN NaN NaN NaN 9.0 4.0 1.0 8.0 2.0 5.0 1.0 2.0 1.0 4.0 4.0 7.0 4.0 3.0 4.0 4.0 5.0 4.0 4.0 1.0 0.0 1.0 3.0 0.0 0.0 1.0 NaN 9.0 1.0 0.0 1.0 4.0 1997.0 W 7.0 2 2A 12 4.0 1.0 0.0 0.0 4.0 4.0 2.0 6.0 4.0 NaN 4.0 1.0 NaN 596.0 2.0 2.0 2.0 0.0 1.0 3.0 4.0 2.0 3.0 3.0 7
4 3.0 1.0 5.0 4.0 3.0 4.0 1.0 3.0 2.0 5.0 5.0 0.0 3.0 32.0 10.0 10.0 5.0 3.0 2.0 1.0 8.0 5.0 6.0 4.0 4.0 2.0 7.0 4.0 4.0 6.0 2.0 3.0 2.0 2.0 4.0 2.0 2.0 0.0 2.0 4.0 4.0 0.0 5.0 2.0 9.0 3.0 0.0 1.0 4.0 1992.0 W 3.0 6 6B 43 1.0 4.0 1.0 0.0 3.0 2.0 5.0 1.0 5.0 3.0 3.0 5.0 5.0 435.0 2.0 4.0 2.0 1.0 2.0 3.0 3.0 4.0 6.0 5.0 0
5 1.0 2.0 2.0 3.0 1.0 5.0 2.0 2.0 5.0 2.0 1.0 0.0 3.0 8.0 2.0 1.0 1.0 4.0 2.0 1.0 3.0 3.0 2.0 4.0 7.0 4.0 2.0 2.0 2.0 5.0 7.0 4.0 4.0 4.0 7.0 6.0 0.0 0.0 2.0 4.0 1.0 0.0 5.0 6.0 9.0 5.0 0.0 1.0 5.0 1992.0 W 7.0 8 8C 54 2.0 2.0 0.0 0.0 4.0 6.0 2.0 7.0 4.0 4.0 4.0 1.0 5.0 1300.0 2.0 3.0 1.0 1.0 1.0 5.0 5.0 2.0 3.0 3.0 0
6 2.0 2.0 5.0 1.0 5.0 1.0 5.0 4.0 3.0 4.0 12.0 0.0 2.0 2.0 1.0 1.0 1.0 2.0 1.0 1.0 10.0 4.0 2.0 5.0 5.0 7.0 2.0 6.0 5.0 5.0 7.0 7.0 4.0 7.0 7.0 7.0 1.0 0.0 1.0 4.0 1.0 0.0 6.0 3.0 9.0 4.0 0.0 1.0 5.0 1992.0 W 5.0 4 4A 22 3.0 2.0 0.0 0.0 3.0 6.0 4.0 3.0 5.0 3.0 5.0 2.0 5.0 867.0 3.0 3.0 1.0 0.0 1.0 5.0 5.0 4.0 6.0 3.0 0
7 1.0 1.0 3.0 3.0 3.0 4.0 1.0 3.0 2.0 5.0 9.0 0.0 1.0 5.0 2.0 1.0 1.0 1.0 1.0 1.0 8.0 5.0 7.0 7.0 7.0 5.0 6.0 2.0 2.0 7.0 5.0 1.0 1.0 2.0 5.0 5.0 0.0 0.0 1.0 1.0 1.0 0.0 4.0 5.0 9.0 6.0 0.0 8.0 3.0 1992.0 W 1.0 2 2D 14 2.0 2.0 0.0 0.0 4.0 2.0 5.0 3.0 4.0 1.0 4.0 1.0 1.0 758.0 3.0 3.0 1.0 0.0 1.0 4.0 4.0 2.0 5.0 2.0 0
8 3.0 1.0 3.0 4.0 4.0 2.0 4.0 2.0 2.0 6.0 3.0 1.0 3.0 10.0 3.0 1.0 1.0 10.0 5.0 1.0 11.0 4.0 4.0 5.0 4.0 1.0 5.0 6.0 4.0 5.0 2.0 5.0 5.0 3.0 1.0 4.0 3.0 0.0 2.0 6.0 1.0 0.0 3.0 5.0 8.0 2.0 1.0 3.0 4.0 1992.0 W 1.0 1 1A 13 1.0 1.0 0.0 0.0 5.0 3.0 4.0 4.0 4.0 1.0 3.0 2.0 3.0 511.0 2.0 3.0 2.0 1.0 1.0 3.0 3.0 2.0 4.0 3.0 0
9 3.0 2.0 4.0 2.0 4.0 2.0 3.0 5.0 4.0 1.0 12.0 1.0 2.0 4.0 1.0 1.0 1.0 3.0 2.0 1.0 15.0 4.0 2.0 1.0 1.0 3.0 2.0 6.0 6.0 3.0 4.0 7.0 6.0 7.0 1.0 3.0 3.0 0.0 2.0 4.0 1.0 0.0 4.0 4.0 3.0 9.0 0.0 3.0 4.0 1992.0 W 7.0 1 1E 15 1.0 3.0 1.0 0.0 2.0 6.0 5.0 4.0 3.0 1.0 3.0 3.0 1.0 530.0 2.0 3.0 2.0 1.0 1.0 3.0 3.0 2.0 3.0 1.0 0
10 3.0 2.0 1.0 2.0 2.0 5.0 3.0 1.0 5.0 6.0 8.0 0.0 2.0 6.0 2.0 1.0 1.0 1.0 1.0 1.0 3.0 5.0 5.0 2.0 3.0 5.0 5.0 6.0 6.0 1.0 4.0 7.0 4.0 7.0 5.0 3.0 2.0 0.0 2.0 3.0 1.0 0.0 6.0 6.0 9.0 6.0 0.0 1.0 4.0 1992.0 W 5.0 9 9D 51 0.0 4.0 1.0 0.0 2.0 6.0 4.0 4.0 5.0 4.0 1.0 1.0 7.0 424.0 2.0 4.0 2.0 0.0 2.0 3.0 3.0 4.0 6.0 5.0 0
11 2.0 1.0 6.0 3.0 4.0 3.0 5.0 5.0 3.0 4.0 5.0 0.0 NaN NaN NaN NaN NaN 5.0 2.0 NaN NaN 3.0 2.0 6.0 7.0 5.0 1.0 5.0 3.0 3.0 4.0 7.0 6.0 6.0 5.0 3.0 NaN NaN NaN 3.0 NaN NaN 2.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 47
12 3.0 1.0 6.0 5.0 3.0 4.0 2.0 4.0 1.0 3.0 10.0 0.0 1.0 23.0 6.0 5.0 3.0 9.0 4.0 1.0 8.0 5.0 3.0 4.0 4.0 6.0 7.0 6.0 4.0 5.0 5.0 5.0 2.0 4.0 4.0 2.0 1.0 0.0 1.0 1.0 2.0 0.0 4.0 6.0 4.0 1.0 0.0 1.0 5.0 2005.0 W 3.0 6 6B 43 NaN NaN NaN NaN NaN 2.0 5.0 4.0 4.0 3.0 NaN 4.0 7.0 1106.0 3.0 3.0 1.0 0.0 1.0 5.0 5.0 3.0 6.0 4.0 6
13 1.0 2.0 5.0 1.0 4.0 3.0 5.0 5.0 2.0 1.0 12.0 1.0 3.0 3.0 1.0 1.0 1.0 5.0 2.0 1.0 15.0 1.0 2.0 4.0 5.0 4.0 1.0 2.0 4.0 3.0 6.0 7.0 6.0 4.0 5.0 6.0 3.0 0.0 2.0 5.0 1.0 0.0 3.0 6.0 3.0 2.0 0.0 1.0 5.0 2009.0 W 5.0 5 5C 33 NaN NaN NaN NaN NaN 7.0 2.0 8.0 4.0 NaN NaN 5.0 NaN 486.0 2.0 1.0 1.0 1.0 1.0 3.0 3.0 3.0 6.0 4.0 8
14 3.0 1.0 6.0 3.0 4.0 3.0 5.0 5.0 3.0 4.0 5.0 0.0 NaN NaN NaN NaN NaN 5.0 2.0 NaN NaN 3.0 2.0 6.0 7.0 5.0 1.0 5.0 3.0 3.0 4.0 7.0 6.0 6.0 5.0 3.0 NaN NaN NaN 3.0 NaN NaN 2.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 47
15 4.0 2.0 4.0 4.0 1.0 5.0 1.0 1.0 4.0 2.0 12.0 0.0 2.0 12.0 3.0 1.0 1.0 8.0 4.0 1.0 5.0 5.0 5.0 2.0 2.0 5.0 2.0 7.0 7.0 1.0 3.0 6.0 5.0 6.0 3.0 3.0 1.0 0.0 2.0 3.0 1.0 0.0 5.0 6.0 4.0 1.0 0.0 8.0 1.0 1992.0 W 3.0 8 8B 41 0.0 2.0 2.0 0.0 3.0 1.0 6.0 2.0 3.0 3.0 2.0 2.0 7.0 NaN NaN NaN NaN NaN NaN NaN NaN 4.0 8.0 5.0 8
16 1.0 2.0 1.0 4.0 3.0 1.0 4.0 5.0 1.0 3.0 10.0 0.0 3.0 NaN NaN NaN NaN 8.0 4.0 3.0 14.0 1.0 1.0 4.0 5.0 4.0 2.0 2.0 4.0 5.0 7.0 4.0 4.0 4.0 6.0 6.0 2.0 0.0 2.0 5.0 0.0 0.0 5.0 NaN 3.0 NaN 0.0 2.0 5.0 1994.0 W 7.0 7 7A 41 3.0 3.0 0.0 0.0 3.0 6.0 2.0 6.0 4.0 3.0 3.0 3.0 5.0 876.0 3.0 3.0 1.0 0.0 1.0 3.0 4.0 1.0 2.0 1.0 6
17 2.0 1.0 6.0 3.0 4.0 3.0 5.0 5.0 3.0 4.0 5.0 0.0 NaN NaN NaN NaN NaN 5.0 2.0 NaN NaN 3.0 2.0 6.0 7.0 5.0 1.0 5.0 3.0 3.0 4.0 7.0 6.0 6.0 5.0 3.0 NaN NaN NaN 3.0 NaN NaN 2.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 47
18 2.0 2.0 6.0 2.0 4.0 1.0 5.0 4.0 1.0 1.0 11.0 0.0 2.0 10.0 3.0 1.0 1.0 8.0 4.0 2.0 10.0 4.0 1.0 5.0 4.0 7.0 4.0 2.0 5.0 4.0 4.0 4.0 7.0 5.0 4.0 7.0 1.0 0.0 1.0 4.0 1.0 0.0 4.0 NaN 5.0 2.0 0.0 1.0 2.0 1996.0 W 2.0 4 4C 24 2.0 1.0 0.0 0.0 5.0 6.0 3.0 8.0 4.0 NaN 5.0 3.0 NaN 387.0 2.0 3.0 2.0 1.0 1.0 3.0 3.0 3.0 4.0 3.0 3
19 3.0 1.0 3.0 5.0 2.0 3.0 1.0 3.0 1.0 5.0 8.0 1.0 3.0 20.0 5.0 2.0 2.0 10.0 5.0 1.0 9.0 5.0 3.0 5.0 4.0 4.0 7.0 6.0 4.0 6.0 2.0 5.0 5.0 3.0 3.0 4.0 2.0 0.0 2.0 2.0 2.0 0.0 1.0 5.0 6.0 1.0 0.0 8.0 1.0 1992.0 W 1.0 5 5D 34 1.0 3.0 0.0 0.0 3.0 6.0 6.0 6.0 3.0 1.0 3.0 1.0 3.0 621.0 2.0 4.0 2.0 1.0 2.0 5.0 4.0 4.0 6.0 3.0 0
20 2.0 2.0 4.0 4.0 3.0 1.0 4.0 5.0 1.0 3.0 11.0 1.0 3.0 10.0 3.0 1.0 1.0 10.0 5.0 1.0 11.0 5.0 2.0 5.0 4.0 3.0 2.0 2.0 5.0 3.0 6.0 7.0 6.0 5.0 5.0 7.0 1.0 0.0 1.0 3.0 1.0 0.0 2.0 3.0 4.0 1.0 0.0 3.0 5.0 2002.0 W 5.0 4 4C 24 1.0 0.0 0.0 0.0 5.0 5.0 2.0 6.0 1.0 NaN 5.0 3.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 3.0 4.0 1.0 10
21 2.0 1.0 3.0 3.0 4.0 1.0 2.0 5.0 1.0 3.0 11.0 0.0 2.0 10.0 3.0 1.0 1.0 10.0 5.0 1.0 10.0 3.0 7.0 7.0 6.0 7.0 7.0 2.0 2.0 7.0 4.0 2.0 2.0 2.0 7.0 5.0 1.0 0.0 2.0 4.0 1.0 0.0 5.0 5.0 9.0 2.0 0.0 1.0 1.0 1992.0 W 4.0 9 9E 55 1.0 3.0 0.0 0.0 3.0 2.0 6.0 3.0 3.0 4.0 3.0 2.0 6.0 694.0 1.0 4.0 3.0 2.0 4.0 4.0 3.0 5.0 7.0 5.0 0
22 1.0 1.0 4.0 1.0 5.0 3.0 5.0 5.0 5.0 1.0 4.0 0.0 2.0 1.0 1.0 1.0 1.0 2.0 1.0 1.0 14.0 1.0 7.0 7.0 7.0 5.0 6.0 3.0 2.0 7.0 4.0 2.0 2.0 2.0 7.0 5.0 1.0 0.0 2.0 4.0 1.0 0.0 6.0 6.0 4.0 14.0 0.0 3.0 1.0 1992.0 W 4.0 9 9D 51 1.0 3.0 0.0 0.0 3.0 3.0 4.0 5.0 4.0 4.0 2.0 3.0 7.0 537.0 2.0 4.0 2.0 1.0 2.0 4.0 3.0 4.0 5.0 5.0 0
23 3.0 1.0 3.0 5.0 3.0 3.0 2.0 2.0 1.0 6.0 12.0 1.0 3.0 20.0 5.0 2.0 2.0 10.0 5.0 2.0 9.0 5.0 4.0 5.0 4.0 1.0 5.0 1.0 4.0 6.0 2.0 5.0 5.0 3.0 5.0 4.0 0.0 0.0 2.0 4.0 2.0 0.0 4.0 3.0 2.0 1.0 0.0 1.0 3.0 1992.0 W 3.0 6 6B 43 4.0 1.0 0.0 0.0 4.0 2.0 5.0 5.0 4.0 2.0 3.0 4.0 6.0 NaN NaN NaN NaN NaN NaN NaN NaN 3.0 6.0 2.0 8
24 3.0 2.0 6.0 3.0 4.0 3.0 5.0 5.0 3.0 4.0 5.0 0.0 NaN NaN NaN NaN NaN 5.0 2.0 NaN NaN 3.0 2.0 6.0 7.0 5.0 1.0 5.0 3.0 3.0 4.0 7.0 6.0 6.0 5.0 3.0 NaN NaN NaN 3.0 NaN NaN 2.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 47
25 1.0 1.0 3.0 3.0 5.0 3.0 5.0 4.0 3.0 4.0 1.0 0.0 2.0 NaN NaN NaN NaN 8.0 4.0 1.0 14.0 1.0 7.0 7.0 7.0 5.0 7.0 2.0 1.0 7.0 4.0 2.0 1.0 2.0 7.0 5.0 0.0 0.0 2.0 5.0 0.0 0.0 5.0 NaN 3.0 1.0 0.0 1.0 5.0 1992.0 W 7.0 5 5C 33 1.0 2.0 0.0 0.0 5.0 7.0 1.0 8.0 4.0 3.0 4.0 3.0 6.0 1500.0 2.0 2.0 0.0 0.0 1.0 4.0 5.0 3.0 2.0 5.0 5
26 3.0 1.0 3.0 5.0 2.0 4.0 2.0 3.0 1.0 3.0 2.0 0.0 1.0 11.0 3.0 1.0 1.0 8.0 4.0 1.0 8.0 3.0 3.0 6.0 4.0 6.0 7.0 1.0 4.0 5.0 5.0 5.0 2.0 3.0 4.0 4.0 1.0 0.0 1.0 1.0 1.0 0.0 2.0 3.0 3.0 1.0 0.0 1.0 5.0 2015.0 W 7.0 NaN NaN NaN NaN NaN NaN NaN NaN 7.0 3.0 7.0 5.0 NaN NaN 3.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 4.0 3.0 5.0 19
27 3.0 1.0 4.0 3.0 3.0 4.0 1.0 2.0 2.0 5.0 8.0 1.0 3.0 25.0 7.0 7.0 4.0 3.0 2.0 1.0 11.0 1.0 3.0 5.0 4.0 1.0 5.0 6.0 4.0 6.0 2.0 5.0 5.0 3.0 3.0 4.0 0.0 0.0 2.0 4.0 2.0 0.0 4.0 1.0 3.0 6.0 0.0 1.0 2.0 1994.0 W 3.0 9 9B 51 2.0 1.0 0.0 0.0 4.0 1.0 6.0 2.0 3.0 2.0 4.0 5.0 3.0 711.0 2.0 3.0 3.0 2.0 4.0 5.0 3.0 4.0 7.0 5.0 0
28 3.0 1.0 2.0 3.0 2.0 4.0 3.0 3.0 2.0 3.0 1.0 0.0 3.0 5.0 2.0 1.0 1.0 1.0 1.0 1.0 10.0 4.0 6.0 6.0 4.0 4.0 7.0 6.0 4.0 5.0 2.0 5.0 5.0 3.0 3.0 4.0 1.0 0.0 2.0 6.0 1.0 0.0 6.0 5.0 8.0 3.0 0.0 1.0 5.0 1992.0 W 5.0 1 1A 13 1.0 2.0 0.0 0.0 3.0 4.0 2.0 5.0 4.0 4.0 5.0 0.0 6.0 1300.0 3.0 3.0 1.0 0.0 1.0 5.0 5.0 3.0 4.0 1.0 0
29 4.0 2.0 1.0 5.0 1.0 5.0 2.0 1.0 3.0 6.0 5.0 1.0 2.0 13.0 3.0 1.0 1.0 10.0 5.0 1.0 6.0 2.0 5.0 1.0 2.0 3.0 4.0 4.0 7.0 1.0 3.0 4.0 5.0 6.0 4.0 4.0 1.0 0.0 2.0 1.0 1.0 0.0 2.0 6.0 8.0 1.0 0.0 1.0 4.0 1994.0 W 3.0 2 2A 12 4.0 1.0 0.0 0.0 5.0 5.0 4.0 5.0 5.0 NaN 5.0 2.0 NaN 545.0 3.0 2.0 2.0 0.0 1.0 3.0 4.0 2.0 4.0 3.0 2
30 3.0 2.0 3.0 4.0 3.0 4.0 4.0 4.0 1.0 3.0 11.0 0.0 2.0 7.0 2.0 1.0 1.0 4.0 2.0 1.0 8.0 2.0 2.0 1.0 1.0 2.0 4.0 1.0 6.0 1.0 4.0 7.0 6.0 7.0 4.0 3.0 3.0 0.0 2.0 3.0 1.0 0.0 NaN NaN 9.0 NaN NaN NaN 5.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 34
31 2.0 2.0 6.0 1.0 5.0 2.0 3.0 5.0 4.0 1.0 1.0 0.0 1.0 2.0 1.0 1.0 1.0 1.0 1.0 1.0 10.0 1.0 4.0 2.0 5.0 7.0 3.0 1.0 5.0 5.0 7.0 4.0 7.0 7.0 6.0 7.0 1.0 0.0 1.0 4.0 1.0 0.0 6.0 6.0 9.0 11.0 0.0 8.0 1.0 1992.0 W 2.0 9 9D 51 0.0 3.0 3.0 0.0 2.0 1.0 6.0 1.0 3.0 3.0 1.0 2.0 7.0 348.0 1.0 3.0 3.0 2.0 5.0 4.0 2.0 4.0 8.0 5.0 0
32 1.0 1.0 4.0 5.0 4.0 2.0 3.0 4.0 1.0 3.0 10.0 1.0 1.0 18.0 5.0 2.0 2.0 10.0 5.0 1.0 15.0 1.0 7.0 7.0 7.0 5.0 6.0 3.0 2.0 7.0 1.0 1.0 1.0 1.0 7.0 5.0 0.0 0.0 1.0 1.0 2.0 0.0 1.0 6.0 9.0 1.0 0.0 1.0 2.0 1992.0 W 7.0 1 1B 14 4.0 1.0 0.0 0.0 4.0 6.0 3.0 5.0 4.0 2.0 3.0 3.0 3.0 452.0 2.0 3.0 1.0 0.0 1.0 3.0 4.0 4.0 3.0 5.0 0
33 2.0 2.0 4.0 2.0 2.0 4.0 1.0 3.0 4.0 5.0 5.0 0.0 2.0 31.0 10.0 11.0 5.0 1.0 1.0 2.0 8.0 3.0 2.0 5.0 4.0 7.0 2.0 2.0 5.0 2.0 6.0 5.0 7.0 7.0 5.0 7.0 1.0 0.0 2.0 3.0 4.0 0.0 4.0 4.0 9.0 3.0 0.0 8.0 1.0 1992.0 W 1.0 8 8B 41 1.0 3.0 0.0 0.0 3.0 1.0 6.0 2.0 3.0 1.0 2.0 2.0 1.0 439.0 1.0 3.0 2.0 1.0 5.0 5.0 3.0 3.0 8.0 5.0 0
34 1.0 2.0 2.0 4.0 1.0 5.0 1.0 2.0 3.0 2.0 5.0 1.0 NaN 13.0 3.0 1.0 1.0 10.0 5.0 NaN 4.0 5.0 2.0 4.0 7.0 4.0 1.0 7.0 7.0 3.0 6.0 6.0 5.0 6.0 6.0 3.0 NaN 0.0 NaN 1.0 1.0 0.0 2.0 6.0 5.0 1.0 0.0 1.0 3.0 1992.0 W 3.0 3 3D 25 3.0 0.0 0.0 0.0 4.0 1.0 5.0 2.0 5.0 2.0 3.0 2.0 5.0 758.0 2.0 3.0 2.0 1.0 1.0 3.0 3.0 4.0 7.0 5.0 4
35 2.0 2.0 2.0 3.0 4.0 3.0 5.0 5.0 3.0 4.0 7.0 0.0 NaN 19.0 5.0 2.0 2.0 9.0 4.0 NaN NaN 3.0 2.0 6.0 7.0 5.0 1.0 5.0 3.0 3.0 4.0 7.0 6.0 6.0 5.0 3.0 NaN NaN NaN 3.0 NaN NaN 2.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 4.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 43
36 3.0 2.0 2.0 2.0 3.0 4.0 3.0 2.0 3.0 6.0 5.0 0.0 2.0 5.0 2.0 1.0 1.0 1.0 1.0 1.0 8.0 5.0 5.0 2.0 3.0 2.0 2.0 1.0 6.0 1.0 4.0 7.0 7.0 7.0 5.0 2.0 2.0 0.0 2.0 3.0 1.0 0.0 5.0 3.0 9.0 13.0 0.0 1.0 2.0 1992.0 W 3.0 7 7A 41 0.0 0.0 3.0 0.0 3.0 1.0 5.0 4.0 3.0 3.0 2.0 0.0 5.0 442.0 2.0 3.0 2.0 1.0 1.0 3.0 3.0 4.0 9.0 4.0 0
37 4.0 1.0 2.0 5.0 1.0 5.0 1.0 1.0 3.0 6.0 5.0 0.0 2.0 38.0 12.0 10.0 5.0 9.0 4.0 1.0 8.0 5.0 3.0 6.0 3.0 2.0 6.0 7.0 7.0 4.0 3.0 3.0 3.0 2.0 2.0 4.0 2.0 0.0 2.0 1.0 3.0 0.0 3.0 2.0 9.0 1.0 0.0 1.0 3.0 1992.0 W 4.0 4 4C 24 3.0 1.0 0.0 0.0 4.0 6.0 2.0 6.0 5.0 3.0 4.0 2.0 5.0 469.0 4.0 2.0 1.0 0.0 1.0 3.0 4.0 3.0 5.0 2.0 0
38 2.0 2.0 6.0 3.0 3.0 1.0 4.0 4.0 2.0 4.0 1.0 0.0 2.0 10.0 3.0 1.0 1.0 9.0 4.0 2.0 10.0 1.0 2.0 5.0 4.0 7.0 2.0 2.0 5.0 5.0 7.0 7.0 7.0 5.0 7.0 7.0 1.0 0.0 1.0 4.0 1.0 0.0 5.0 3.0 9.0 2.0 0.0 1.0 3.0 1992.0 W 3.0 3 3D 25 4.0 2.0 0.0 0.0 4.0 6.0 3.0 8.0 5.0 4.0 3.0 2.0 4.0 442.0 4.0 2.0 0.0 0.0 1.0 3.0 4.0 3.0 5.0 2.0 0
39 4.0 1.0 4.0 5.0 3.0 4.0 3.0 1.0 1.0 6.0 10.0 1.0 2.0 35.0 11.0 11.0 5.0 10.0 5.0 1.0 11.0 2.0 6.0 6.0 3.0 6.0 6.0 5.0 3.0 6.0 3.0 1.0 3.0 2.0 3.0 1.0 0.0 0.0 1.0 1.0 3.0 0.0 2.0 1.0 9.0 1.0 0.0 1.0 5.0 1996.0 W 2.0 NaN NaN NaN 3.0 1.0 0.0 0.0 4.0 6.0 3.0 6.0 4.0 NaN 4.0 5.0 NaN 670.0 2.0 3.0 2.0 1.0 1.0 4.0 4.0 3.0 5.0 3.0 5
40 4.0 2.0 1.0 3.0 2.0 5.0 2.0 1.0 5.0 6.0 4.0 0.0 1.0 8.0 2.0 1.0 1.0 4.0 2.0 1.0 NaN 2.0 2.0 1.0 1.0 2.0 2.0 5.0 7.0 1.0 3.0 7.0 5.0 5.0 2.0 1.0 1.0 0.0 1.0 3.0 1.0 0.0 6.0 6.0 9.0 3.0 0.0 3.0 5.0 1992.0 W 3.0 NaN NaN NaN 2.0 2.0 0.0 0.0 4.0 6.0 3.0 6.0 4.0 4.0 4.0 0.0 6.0 203.0 3.0 2.0 1.0 0.0 1.0 2.0 3.0 3.0 5.0 3.0 4
41 1.0 2.0 3.0 2.0 5.0 3.0 5.0 5.0 2.0 1.0 9.0 1.0 3.0 30.0 9.0 10.0 5.0 5.0 2.0 1.0 15.0 4.0 1.0 4.0 4.0 4.0 1.0 2.0 2.0 3.0 6.0 7.0 7.0 6.0 6.0 6.0 2.0 0.0 2.0 2.0 5.0 0.0 2.0 2.0 9.0 1.0 0.0 1.0 3.0 1992.0 W 2.0 4 4E 25 4.0 1.0 0.0 0.0 4.0 2.0 4.0 4.0 3.0 2.0 5.0 5.0 2.0 635.0 3.0 2.0 1.0 0.0 1.0 4.0 5.0 3.0 5.0 2.0 0
42 1.0 2.0 5.0 3.0 4.0 2.0 5.0 4.0 2.0 3.0 10.0 0.0 3.0 10.0 3.0 1.0 1.0 8.0 4.0 1.0 14.0 1.0 3.0 4.0 4.0 4.0 4.0 2.0 4.0 4.0 4.0 7.0 4.0 4.0 4.0 6.0 0.0 0.0 2.0 1.0 1.0 0.0 3.0 6.0 3.0 2.0 0.0 1.0 5.0 1994.0 W 7.0 4 4B 23 3.0 1.0 0.0 0.0 4.0 6.0 2.0 5.0 5.0 NaN 5.0 5.0 NaN 309.0 4.0 3.0 1.0 0.0 1.0 2.0 3.0 3.0 3.0 1.0 2
43 1.0 2.0 4.0 1.0 5.0 3.0 5.0 5.0 5.0 1.0 10.0 0.0 3.0 1.0 1.0 1.0 1.0 2.0 1.0 1.0 14.0 1.0 1.0 4.0 4.0 4.0 1.0 3.0 2.0 3.0 6.0 6.0 7.0 6.0 6.0 6.0 2.0 0.0 1.0 5.0 1.0 0.0 6.0 6.0 4.0 4.0 0.0 1.0 3.0 1992.0 W 2.0 4 4E 25 1.0 2.0 1.0 0.0 3.0 6.0 5.0 3.0 3.0 2.0 3.0 5.0 3.0 463.0 2.0 3.0 1.0 0.0 1.0 4.0 4.0 3.0 7.0 2.0 0
44 3.0 2.0 4.0 3.0 2.0 4.0 4.0 2.0 2.0 6.0 11.0 0.0 3.0 7.0 2.0 1.0 1.0 4.0 2.0 1.0 8.0 2.0 2.0 1.0 3.0 2.0 2.0 4.0 6.0 3.0 4.0 4.0 4.0 5.0 4.0 2.0 3.0 0.0 1.0 3.0 1.0 0.0 5.0 4.0 6.0 4.0 0.0 3.0 5.0 1992.0 W 7.0 3 3C 24 1.0 2.0 0.0 0.0 4.0 6.0 2.0 6.0 3.0 3.0 3.0 3.0 6.0 352.0 3.0 2.0 1.0 0.0 1.0 2.0 3.0 1.0 1.0 1.0 0
45 4.0 2.0 2.0 5.0 1.0 5.0 2.0 1.0 2.0 2.0 7.0 0.0 2.0 38.0 12.0 11.0 5.0 9.0 4.0 1.0 5.0 3.0 2.0 1.0 2.0 1.0 3.0 7.0 7.0 2.0 3.0 4.0 4.0 5.0 2.0 3.0 3.0 0.0 1.0 4.0 5.0 0.0 4.0 1.0 9.0 1.0 0.0 1.0 5.0 1996.0 W 7.0 7 7A 41 2.0 1.0 0.0 0.0 5.0 7.0 2.0 8.0 5.0 3.0 5.0 3.0 5.0 1300.0 3.0 2.0 1.0 0.0 1.0 5.0 5.0 3.0 2.0 3.0 0
46 NaN 2.0 3.0 2.0 4.0 3.0 5.0 5.0 4.0 1.0 1.0 0.0 NaN NaN NaN 1.0 1.0 2.0 1.0 1.0 8.0 5.0 2.0 1.0 5.0 4.0 4.0 2.0 7.0 3.0 6.0 7.0 6.0 7.0 4.0 1.0 NaN 0.0 NaN 3.0 1.0 0.0 NaN NaN 9.0 NaN NaN NaN 4.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 40
47 4.0 2.0 2.0 4.0 1.0 5.0 1.0 2.0 3.0 5.0 12.0 1.0 2.0 6.0 2.0 1.0 1.0 1.0 1.0 1.0 4.0 5.0 2.0 1.0 1.0 3.0 4.0 7.0 7.0 2.0 2.0 6.0 5.0 5.0 4.0 3.0 3.0 0.0 1.0 3.0 1.0 0.0 3.0 6.0 9.0 2.0 0.0 1.0 4.0 1992.0 W 2.0 3 3D 25 3.0 2.0 0.0 0.0 3.0 3.0 2.0 5.0 4.0 2.0 4.0 1.0 2.0 524.0 2.0 3.0 2.0 1.0 1.0 3.0 3.0 3.0 5.0 3.0 0
48 3.0 2.0 6.0 3.0 4.0 3.0 5.0 5.0 3.0 4.0 5.0 0.0 NaN NaN NaN NaN NaN 5.0 2.0 NaN NaN 3.0 2.0 6.0 7.0 5.0 1.0 5.0 3.0 3.0 4.0 7.0 6.0 6.0 5.0 3.0 NaN NaN NaN 3.0 NaN NaN 2.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 47
49 2.0 1.0 6.0 4.0 5.0 1.0 4.0 4.0 2.0 4.0 12.0 0.0 3.0 10.0 3.0 1.0 1.0 9.0 4.0 2.0 10.0 1.0 7.0 7.0 6.0 7.0 6.0 6.0 3.0 7.0 4.0 5.0 4.0 1.0 6.0 5.0 1.0 0.0 1.0 1.0 1.0 0.0 6.0 4.0 9.0 1.0 0.0 8.0 4.0 1992.0 W 3.0 6 6B 43 0.0 3.0 3.0 0.0 2.0 3.0 3.0 5.0 4.0 2.0 2.0 2.0 2.0 891.0 3.0 2.0 1.0 0.0 1.0 4.0 5.0 3.0 4.0 2.0 0
In [56]:
# checking my NAN_count in each row
azdias_cleaned.isnull().sum(axis=1)
Out[56]:
0         43
1          0
2          0
3          7
4          0
5          0
6          0
7          0
8          0
9          0
10         0
11        47
12         6
13         8
14        47
15         8
16         6
17        47
18         3
19         0
20        10
21         0
22         0
23         8
24        47
25         5
26        19
27         0
28         0
29         2
30        34
31         0
32         0
33         0
34         4
35        43
36         0
37         0
38         0
39         5
40         4
41         0
42         2
43         0
44         0
45         0
46        40
47         0
48        47
49         0
50         0
51         0
52         0
53        43
54        47
55         0
56         0
57         0
58         0
59         0
60         4
61        43
62        43
63         0
64         5
65         0
66         0
67         0
68         0
69        47
70         7
71         0
72         1
73         8
74         0
75        47
76        43
77         0
78         0
79         0
80         2
81        45
82         0
83        43
84         0
85         0
86         0
87         0
88         8
89         4
90        34
91         0
92         3
93         0
94         3
95         0
96         5
97        47
98         0
99        47
          ..
891121     0
891122     0
891123     0
891124     0
891125     0
891126     0
891127     0
891128     5
891129     8
891130    34
891131     0
891132     0
891133     0
891134     0
891135     6
891136     2
891137    34
891138     0
891139     7
891140    34
891141     0
891142     0
891143     0
891144     0
891145     0
891146     0
891147     0
891148     0
891149     0
891150     3
891151     0
891152     0
891153     0
891154    11
891155     4
891156     0
891157     4
891158     0
891159    43
891160     0
891161     3
891162     6
891163     0
891164    43
891165     0
891166     0
891167     0
891168     0
891169    18
891170    19
891171    34
891172    16
891173    19
891174     0
891175    47
891176     0
891177     0
891178     0
891179     0
891180     0
891181     3
891182     0
891183     0
891184     8
891185    47
891186     0
891187    47
891188     0
891189     0
891190     0
891191     0
891192     0
891193     0
891194     0
891195     0
891196     0
891197     0
891198     0
891199     0
891200     0
891201     0
891202     0
891203    14
891204     0
891205     0
891206     0
891207     0
891208     3
891209     0
891210     0
891211     0
891212     0
891213     0
891214     0
891215     0
891216     3
891217     4
891218     5
891219     0
891220     0
Length: 891221, dtype: int64
In [57]:
row_wise_NA_count_summary_report_azdias_cleaned = row_wise_NA_count_summary_report(azdias_cleaned)

row_wise_NA_count_summary_report_azdias_cleaned

# this is an aggregation. How many missing values per row vs frequency
Out[57]:
NAN Count in Row Frequency
0 0 623209
7 1 15738
2 2 27926
6 3 17629
10 4 12607
5 5 22515
8 6 13771
9 7 13714
4 8 24592
16 9 3042
12 10 5410
21 11 1127
22 12 766
15 13 3255
19 14 2243
13 15 4743
18 16 2505
23 17 677
26 18 428
20 19 1180
29 20 349
32 21 150
35 22 129
34 23 132
38 24 69
40 25 55
39 26 59
41 27 24
45 28 5
43 29 12
44 30 6
46 31 3
30 32 206
17 33 2985
11 34 10816
14 35 3911
36 36 84
24 37 538
27 38 421
37 39 77
33 40 137
28 41 356
42 42 21
3 43 27369
31 44 155
25 45 494
1 47 45578
48 48 1
47 49 2
In [58]:
len(azdias_cleaned) -623209
Out[58]:
268012
In [59]:
print(sum(azdias_cleaned.isnull().any(axis=1)))
268012
In [60]:
# 268012 rows have missing values
In [61]:
# 30% of rows have missing values - this calculation is manual and hardcoded
(len(azdias_cleaned) -623209) / len(azdias_cleaned)
Out[61]:
0.3007245116531141
In [62]:
print(azdias_cleaned['NAN_count'].describe())
count    891221.000000
mean          5.649894
std          13.234687
min           0.000000
25%           0.000000
50%           0.000000
75%           3.000000
max          49.000000
Name: NAN_count, dtype: float64
In [63]:
sns.set(style="darkgrid")
fig, ax = plt.subplots(1, 1,figsize=(20, 10))
sns.countplot(x="NAN_count", data=azdias_cleaned, color = 'blue', alpha=0.5)

plt.ylabel('Frequency', fontsize=18)
plt.xlabel('NAN Values in Row', fontsize=18)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.title('Missing Data in Rows', fontsize=18)
plt.show()
In [64]:
fig, ax = plt.subplots(1, 1,figsize=(20, 10))

sns.countplot(x="NAN_count", data=azdias_cleaned[azdias_cleaned['NAN_count']>0], color = 'blue', alpha=0.5)

plt.ylabel('Frequency', fontsize=18)
plt.xlabel('NAN Values in Row', fontsize=18)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.title('Missing Data in Rows - Only Rows with Missing Data', fontsize=18)
plt.show()

This is just my screenshot, where I can draw things on

image.png

Notice how there are 3 "clusters" of data depending on how many values with missing data there are in each row. We can remove data with missing rows, depending on our threshold. Possible thresholds are:

29-30ish

40-42ish

9-10 ish is also possible, to tolerate very few missing values

Below I will explore what's going to happen with different thresholds: how much data will be dropped.

In [65]:
azdias_cleaned.head()
Out[65]:
ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER FINANZTYP GFK_URLAUBERTYP GREEN_AVANTGARDE HEALTH_TYP LP_LEBENSPHASE_FEIN LP_LEBENSPHASE_GROB LP_FAMILIE_FEIN LP_FAMILIE_GROB LP_STATUS_FEIN LP_STATUS_GROB NATIONALITAET_KZ PRAEGENDE_JUGENDJAHRE RETOURTYP_BK_S SEMIO_SOZ SEMIO_FAM SEMIO_REL SEMIO_MAT SEMIO_VERT SEMIO_LUST SEMIO_ERL SEMIO_KULT SEMIO_RAT SEMIO_KRIT SEMIO_DOM SEMIO_KAEM SEMIO_PFLICHT SEMIO_TRADV SHOPPER_TYP SOHO_KZ VERS_TYP ZABEOTYP ANZ_PERSONEN ANZ_TITEL HH_EINKOMMEN_SCORE W_KEIT_KIND_HH WOHNDAUER_2008 ANZ_HAUSHALTE_AKTIV ANZ_HH_TITEL GEBAEUDETYP KONSUMNAEHE MIN_GEBAEUDEJAHR OST_WEST_KZ WOHNLAGE CAMEO_DEUG_2015 CAMEO_DEU_2015 CAMEO_INTL_2015 KBA05_ANTG1 KBA05_ANTG2 KBA05_ANTG3 KBA05_ANTG4 KBA05_GBZ BALLRAUM EWDICHTE INNENSTADT GEBAEUDETYP_RASTER KKK MOBI_REGIO ONLINE_AFFINITAET REGIOTYP KBA13_ANZAHL_PKW PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB NAN_count
0 2.0 1.0 2.0 3.0 4.0 3.0 5.0 5.0 3.0 4.0 10.0 0.0 NaN 15.0 4.0 2.0 2.0 1.0 1.0 NaN NaN 5.0 2.0 6.0 7.0 5.0 1.0 5.0 3.0 3.0 4.0 7.0 6.0 6.0 5.0 3.0 NaN NaN NaN 3.0 NaN NaN 2.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 43
1 1.0 2.0 5.0 1.0 5.0 2.0 5.0 4.0 5.0 1.0 10.0 0.0 3.0 21.0 6.0 5.0 3.0 2.0 1.0 1.0 14.0 1.0 5.0 4.0 4.0 3.0 1.0 2.0 2.0 3.0 6.0 4.0 7.0 4.0 7.0 6.0 3.0 1.0 2.0 5.0 2.0 0.0 6.0 3.0 9.0 11.0 0.0 8.0 1.0 1992.0 W 4.0 8 8A 51 0.0 0.0 0.0 2.0 1.0 6.0 3.0 8.0 3.0 2.0 1.0 3.0 3.0 963.0 2.0 3.0 2.0 1.0 1.0 5.0 4.0 3.0 5.0 4.0 0
2 3.0 2.0 3.0 1.0 4.0 1.0 2.0 3.0 5.0 1.0 10.0 1.0 3.0 3.0 1.0 1.0 1.0 3.0 2.0 1.0 15.0 3.0 4.0 1.0 3.0 3.0 4.0 4.0 6.0 3.0 4.0 7.0 7.0 7.0 3.0 3.0 2.0 0.0 1.0 5.0 1.0 0.0 4.0 3.0 9.0 10.0 0.0 1.0 5.0 1992.0 W 2.0 4 4C 24 1.0 3.0 1.0 0.0 3.0 2.0 4.0 4.0 4.0 2.0 3.0 2.0 2.0 712.0 3.0 3.0 1.0 0.0 1.0 4.0 4.0 3.0 5.0 2.0 0
3 4.0 2.0 2.0 4.0 2.0 5.0 2.0 1.0 2.0 6.0 1.0 0.0 2.0 NaN NaN NaN NaN 9.0 4.0 1.0 8.0 2.0 5.0 1.0 2.0 1.0 4.0 4.0 7.0 4.0 3.0 4.0 4.0 5.0 4.0 4.0 1.0 0.0 1.0 3.0 0.0 0.0 1.0 NaN 9.0 1.0 0.0 1.0 4.0 1997.0 W 7.0 2 2A 12 4.0 1.0 0.0 0.0 4.0 4.0 2.0 6.0 4.0 NaN 4.0 1.0 NaN 596.0 2.0 2.0 2.0 0.0 1.0 3.0 4.0 2.0 3.0 3.0 7
4 3.0 1.0 5.0 4.0 3.0 4.0 1.0 3.0 2.0 5.0 5.0 0.0 3.0 32.0 10.0 10.0 5.0 3.0 2.0 1.0 8.0 5.0 6.0 4.0 4.0 2.0 7.0 4.0 4.0 6.0 2.0 3.0 2.0 2.0 4.0 2.0 2.0 0.0 2.0 4.0 4.0 0.0 5.0 2.0 9.0 3.0 0.0 1.0 4.0 1992.0 W 3.0 6 6B 43 1.0 4.0 1.0 0.0 3.0 2.0 5.0 1.0 5.0 3.0 3.0 5.0 5.0 435.0 2.0 4.0 2.0 1.0 2.0 3.0 3.0 4.0 6.0 5.0 0
In [66]:
azdias_cleaned['NAN_count'].describe()
Out[66]:
count    891221.000000
mean          5.649894
std          13.234687
min           0.000000
25%           0.000000
50%           0.000000
75%           3.000000
max          49.000000
Name: NAN_count, dtype: float64
In [67]:
# Write code to divide the data into two subsets based on the number of missing
# values in each row.

def divide_into_subsets(df_cleaned, threshold):
    
    """divide data into 2 subsets depending on how many missing values there in row
    provide threshold
    """
    condition = df_cleaned['NAN_count'] > threshold

    df_cleaned['many_missing_values_in_row'] = np.where(condition==True, 1, 0)

    print("Threshold of missing rows in the row: ", threshold)
    print()

    
    print('Total nrow:', len(df_cleaned['many_missing_values_in_row']))
    print()

    df_cleaned_missing_few = df_cleaned[df_cleaned['many_missing_values_in_row'] ==0]
    df_cleaned_missing_many = df_cleaned[df_cleaned['many_missing_values_in_row'] ==1]

    print("Few Missing Values. Will be kept.")
    print(len(df_cleaned_missing_few['NAN_count']))
    print()
    print("Lots of Missing Values. Will be deleted")
    print(len(df_cleaned_missing_many['NAN_count']))
    print()
    print("% of rows with lots of missing values. % of data deleted ") 
    print(df_cleaned['many_missing_values_in_row'].mean())
    
    return df_cleaned_missing_few, df_cleaned_missing_many, threshold
In [68]:
azdias_cleaned_missing_few, azdias_cleaned_missing_many, threshold =   divide_into_subsets(azdias_cleaned, 30)
Threshold of missing rows in the row:  30

Total nrow: 891221

Few Missing Values. Will be kept.
798067

Lots of Missing Values. Will be deleted
93154

% of rows with lots of missing values. % of data deleted 
0.104524018173
In [69]:
# threshold: 9
# keep: 774743
# drop: 116478 (0.130694855709)

# threshold: 29
# keep: 798061
# drop: 93160 (0.104530750509694)

# threshold: 30
# keep: 798067
# drop: 93154 (0.104524018173)

# threshold:  42
# keep: 817622
# drop: 73599 (0.08258221024863642)

Let's use 30 as a threshold for how many mising values we can tolerate in our rows

In [70]:
azdias_cleaned_missing_few, azdias_cleaned_missing_many, threshold =   divide_into_subsets(azdias_cleaned, 30)
Threshold of missing rows in the row:  30

Total nrow: 891221

Few Missing Values. Will be kept.
798067

Lots of Missing Values. Will be deleted
93154

% of rows with lots of missing values. % of data deleted 
0.104524018173
In [71]:
azdias_cleaned.head()
Out[71]:
ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER FINANZTYP GFK_URLAUBERTYP GREEN_AVANTGARDE HEALTH_TYP LP_LEBENSPHASE_FEIN LP_LEBENSPHASE_GROB LP_FAMILIE_FEIN LP_FAMILIE_GROB LP_STATUS_FEIN LP_STATUS_GROB NATIONALITAET_KZ PRAEGENDE_JUGENDJAHRE RETOURTYP_BK_S SEMIO_SOZ SEMIO_FAM SEMIO_REL SEMIO_MAT SEMIO_VERT SEMIO_LUST SEMIO_ERL SEMIO_KULT SEMIO_RAT SEMIO_KRIT SEMIO_DOM SEMIO_KAEM SEMIO_PFLICHT SEMIO_TRADV SHOPPER_TYP SOHO_KZ VERS_TYP ZABEOTYP ANZ_PERSONEN ANZ_TITEL HH_EINKOMMEN_SCORE W_KEIT_KIND_HH WOHNDAUER_2008 ANZ_HAUSHALTE_AKTIV ANZ_HH_TITEL GEBAEUDETYP KONSUMNAEHE MIN_GEBAEUDEJAHR OST_WEST_KZ WOHNLAGE CAMEO_DEUG_2015 CAMEO_DEU_2015 CAMEO_INTL_2015 KBA05_ANTG1 KBA05_ANTG2 KBA05_ANTG3 KBA05_ANTG4 KBA05_GBZ BALLRAUM EWDICHTE INNENSTADT GEBAEUDETYP_RASTER KKK MOBI_REGIO ONLINE_AFFINITAET REGIOTYP KBA13_ANZAHL_PKW PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB NAN_count many_missing_values_in_row
0 2.0 1.0 2.0 3.0 4.0 3.0 5.0 5.0 3.0 4.0 10.0 0.0 NaN 15.0 4.0 2.0 2.0 1.0 1.0 NaN NaN 5.0 2.0 6.0 7.0 5.0 1.0 5.0 3.0 3.0 4.0 7.0 6.0 6.0 5.0 3.0 NaN NaN NaN 3.0 NaN NaN 2.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 43 1
1 1.0 2.0 5.0 1.0 5.0 2.0 5.0 4.0 5.0 1.0 10.0 0.0 3.0 21.0 6.0 5.0 3.0 2.0 1.0 1.0 14.0 1.0 5.0 4.0 4.0 3.0 1.0 2.0 2.0 3.0 6.0 4.0 7.0 4.0 7.0 6.0 3.0 1.0 2.0 5.0 2.0 0.0 6.0 3.0 9.0 11.0 0.0 8.0 1.0 1992.0 W 4.0 8 8A 51 0.0 0.0 0.0 2.0 1.0 6.0 3.0 8.0 3.0 2.0 1.0 3.0 3.0 963.0 2.0 3.0 2.0 1.0 1.0 5.0 4.0 3.0 5.0 4.0 0 0
2 3.0 2.0 3.0 1.0 4.0 1.0 2.0 3.0 5.0 1.0 10.0 1.0 3.0 3.0 1.0 1.0 1.0 3.0 2.0 1.0 15.0 3.0 4.0 1.0 3.0 3.0 4.0 4.0 6.0 3.0 4.0 7.0 7.0 7.0 3.0 3.0 2.0 0.0 1.0 5.0 1.0 0.0 4.0 3.0 9.0 10.0 0.0 1.0 5.0 1992.0 W 2.0 4 4C 24 1.0 3.0 1.0 0.0 3.0 2.0 4.0 4.0 4.0 2.0 3.0 2.0 2.0 712.0 3.0 3.0 1.0 0.0 1.0 4.0 4.0 3.0 5.0 2.0 0 0
3 4.0 2.0 2.0 4.0 2.0 5.0 2.0 1.0 2.0 6.0 1.0 0.0 2.0 NaN NaN NaN NaN 9.0 4.0 1.0 8.0 2.0 5.0 1.0 2.0 1.0 4.0 4.0 7.0 4.0 3.0 4.0 4.0 5.0 4.0 4.0 1.0 0.0 1.0 3.0 0.0 0.0 1.0 NaN 9.0 1.0 0.0 1.0 4.0 1997.0 W 7.0 2 2A 12 4.0 1.0 0.0 0.0 4.0 4.0 2.0 6.0 4.0 NaN 4.0 1.0 NaN 596.0 2.0 2.0 2.0 0.0 1.0 3.0 4.0 2.0 3.0 3.0 7 0
4 3.0 1.0 5.0 4.0 3.0 4.0 1.0 3.0 2.0 5.0 5.0 0.0 3.0 32.0 10.0 10.0 5.0 3.0 2.0 1.0 8.0 5.0 6.0 4.0 4.0 2.0 7.0 4.0 4.0 6.0 2.0 3.0 2.0 2.0 4.0 2.0 2.0 0.0 2.0 4.0 4.0 0.0 5.0 2.0 9.0 3.0 0.0 1.0 4.0 1992.0 W 3.0 6 6B 43 1.0 4.0 1.0 0.0 3.0 2.0 5.0 1.0 5.0 3.0 3.0 5.0 5.0 435.0 2.0 4.0 2.0 1.0 2.0 3.0 3.0 4.0 6.0 5.0 0 0
In [72]:
azdias_cleaned_missing_few.head()
Out[72]:
ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER FINANZTYP GFK_URLAUBERTYP GREEN_AVANTGARDE HEALTH_TYP LP_LEBENSPHASE_FEIN LP_LEBENSPHASE_GROB LP_FAMILIE_FEIN LP_FAMILIE_GROB LP_STATUS_FEIN LP_STATUS_GROB NATIONALITAET_KZ PRAEGENDE_JUGENDJAHRE RETOURTYP_BK_S SEMIO_SOZ SEMIO_FAM SEMIO_REL SEMIO_MAT SEMIO_VERT SEMIO_LUST SEMIO_ERL SEMIO_KULT SEMIO_RAT SEMIO_KRIT SEMIO_DOM SEMIO_KAEM SEMIO_PFLICHT SEMIO_TRADV SHOPPER_TYP SOHO_KZ VERS_TYP ZABEOTYP ANZ_PERSONEN ANZ_TITEL HH_EINKOMMEN_SCORE W_KEIT_KIND_HH WOHNDAUER_2008 ANZ_HAUSHALTE_AKTIV ANZ_HH_TITEL GEBAEUDETYP KONSUMNAEHE MIN_GEBAEUDEJAHR OST_WEST_KZ WOHNLAGE CAMEO_DEUG_2015 CAMEO_DEU_2015 CAMEO_INTL_2015 KBA05_ANTG1 KBA05_ANTG2 KBA05_ANTG3 KBA05_ANTG4 KBA05_GBZ BALLRAUM EWDICHTE INNENSTADT GEBAEUDETYP_RASTER KKK MOBI_REGIO ONLINE_AFFINITAET REGIOTYP KBA13_ANZAHL_PKW PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB NAN_count many_missing_values_in_row
1 1.0 2.0 5.0 1.0 5.0 2.0 5.0 4.0 5.0 1.0 10.0 0.0 3.0 21.0 6.0 5.0 3.0 2.0 1.0 1.0 14.0 1.0 5.0 4.0 4.0 3.0 1.0 2.0 2.0 3.0 6.0 4.0 7.0 4.0 7.0 6.0 3.0 1.0 2.0 5.0 2.0 0.0 6.0 3.0 9.0 11.0 0.0 8.0 1.0 1992.0 W 4.0 8 8A 51 0.0 0.0 0.0 2.0 1.0 6.0 3.0 8.0 3.0 2.0 1.0 3.0 3.0 963.0 2.0 3.0 2.0 1.0 1.0 5.0 4.0 3.0 5.0 4.0 0 0
2 3.0 2.0 3.0 1.0 4.0 1.0 2.0 3.0 5.0 1.0 10.0 1.0 3.0 3.0 1.0 1.0 1.0 3.0 2.0 1.0 15.0 3.0 4.0 1.0 3.0 3.0 4.0 4.0 6.0 3.0 4.0 7.0 7.0 7.0 3.0 3.0 2.0 0.0 1.0 5.0 1.0 0.0 4.0 3.0 9.0 10.0 0.0 1.0 5.0 1992.0 W 2.0 4 4C 24 1.0 3.0 1.0 0.0 3.0 2.0 4.0 4.0 4.0 2.0 3.0 2.0 2.0 712.0 3.0 3.0 1.0 0.0 1.0 4.0 4.0 3.0 5.0 2.0 0 0
3 4.0 2.0 2.0 4.0 2.0 5.0 2.0 1.0 2.0 6.0 1.0 0.0 2.0 NaN NaN NaN NaN 9.0 4.0 1.0 8.0 2.0 5.0 1.0 2.0 1.0 4.0 4.0 7.0 4.0 3.0 4.0 4.0 5.0 4.0 4.0 1.0 0.0 1.0 3.0 0.0 0.0 1.0 NaN 9.0 1.0 0.0 1.0 4.0 1997.0 W 7.0 2 2A 12 4.0 1.0 0.0 0.0 4.0 4.0 2.0 6.0 4.0 NaN 4.0 1.0 NaN 596.0 2.0 2.0 2.0 0.0 1.0 3.0 4.0 2.0 3.0 3.0 7 0
4 3.0 1.0 5.0 4.0 3.0 4.0 1.0 3.0 2.0 5.0 5.0 0.0 3.0 32.0 10.0 10.0 5.0 3.0 2.0 1.0 8.0 5.0 6.0 4.0 4.0 2.0 7.0 4.0 4.0 6.0 2.0 3.0 2.0 2.0 4.0 2.0 2.0 0.0 2.0 4.0 4.0 0.0 5.0 2.0 9.0 3.0 0.0 1.0 4.0 1992.0 W 3.0 6 6B 43 1.0 4.0 1.0 0.0 3.0 2.0 5.0 1.0 5.0 3.0 3.0 5.0 5.0 435.0 2.0 4.0 2.0 1.0 2.0 3.0 3.0 4.0 6.0 5.0 0 0
5 1.0 2.0 2.0 3.0 1.0 5.0 2.0 2.0 5.0 2.0 1.0 0.0 3.0 8.0 2.0 1.0 1.0 4.0 2.0 1.0 3.0 3.0 2.0 4.0 7.0 4.0 2.0 2.0 2.0 5.0 7.0 4.0 4.0 4.0 7.0 6.0 0.0 0.0 2.0 4.0 1.0 0.0 5.0 6.0 9.0 5.0 0.0 1.0 5.0 1992.0 W 7.0 8 8C 54 2.0 2.0 0.0 0.0 4.0 6.0 2.0 7.0 4.0 4.0 4.0 1.0 5.0 1300.0 2.0 3.0 1.0 1.0 1.0 5.0 5.0 2.0 3.0 3.0 0 0
In [73]:
# Investigate patterns in the amount of missing data in each column.
# Very few missing columns
no_or_very_few_NAN = azdias_cleaned_NA_report[azdias_cleaned_NA_report['missing_NA_percent']<0.025]

no_or_very_few_NAN
Out[73]:
Column missing_NA missing_NA_percent
0 ALTERSKATEGORIE_GROB 2881 0.003233
1 ANREDE_KZ 0 0.000000
2 CJT_GESAMTTYP 4854 0.005446
3 FINANZ_MINIMALIST 0 0.000000
4 FINANZ_SPARER 0 0.000000
5 FINANZ_VORSORGER 0 0.000000
6 FINANZ_ANLEGER 0 0.000000
7 FINANZ_UNAUFFAELLIGER 0 0.000000
8 FINANZ_HAUSBAUER 0 0.000000
9 FINANZTYP 0 0.000000
10 GFK_URLAUBERTYP 4854 0.005446
11 GREEN_AVANTGARDE 0 0.000000
17 LP_STATUS_FEIN 4854 0.005446
18 LP_STATUS_GROB 4854 0.005446
21 RETOURTYP_BK_S 4854 0.005446
22 SEMIO_SOZ 0 0.000000
23 SEMIO_FAM 0 0.000000
24 SEMIO_REL 0 0.000000
25 SEMIO_MAT 0 0.000000
26 SEMIO_VERT 0 0.000000
27 SEMIO_LUST 0 0.000000
28 SEMIO_ERL 0 0.000000
29 SEMIO_KULT 0 0.000000
30 SEMIO_RAT 0 0.000000
31 SEMIO_KRIT 0 0.000000
32 SEMIO_DOM 0 0.000000
33 SEMIO_KAEM 0 0.000000
34 SEMIO_PFLICHT 0 0.000000
35 SEMIO_TRADV 0 0.000000
39 ZABEOTYP 0 0.000000
42 HH_EINKOMMEN_SCORE 18348 0.020587
66 ONLINE_AFFINITAET 4854 0.005446
In [74]:
no_or_very_few_NAN['missing_NA_percent'].max()
Out[74]:
0.020587486156632306
In [75]:
# THIS RUNS on my machine but not in Udacity workspace

# AttributeError: module 'seaborn' has no attribute 'catplot'

# """https://seaborn.pydata.org/generated/
# seaborn.countplot.html"""

# i = 0
# for variable in no_or_very_few_NAN['Column'].tolist():
#     if i > 5:
#         break

#     sns.catplot(x=variable, col="many_missing_values_in_row",
#                 data=azdias_cleaned, kind="count",
#                 height=4, aspect=.7)
#     plt.show()
#     i = i+1

Dataset with few vs lost of missing values - are they the same or different? I'm trying this out with diff thresholds of row missing data

In [76]:
# Compare the distribution of values for at least five columns where there are
# no or few missing values, between the two subsets.

def pairwise_comparison(stop_after = 5):
    
    """create countplots of columns with few or no missing values 
    for two datasets: with few and with many missing values in rows
    stops after several variables
    indicates a threshold you picked above (threshold separates two subsets)
    """
    sns.set(style="white", palette="muted", color_codes=True)

    i = 0

    for variable in no_or_very_few_NAN['Column'].tolist():

        # testing and running this check on only a few variables
        if i > stop_after:
            break
    #         pass

        fig, ax = plt.subplots(1, 2,figsize=(15, 7))    

        plt.subplot(1,2,1)  

        sns.countplot(x=variable, data=azdias_cleaned_missing_few, color = 'blue', alpha=0.5)

        plt.ylabel('Frequency', fontsize=14)

        plt.xticks(fontsize=15)
        plt.yticks(fontsize=15)
        title = 'FEW missing values in rows, threshold: ' + str(threshold)
        plt.title(title,  fontsize=18)


        plt.subplot(1,2,2) 
        sns.countplot(x=variable, data=azdias_cleaned_missing_many, color = 'blue', alpha=0.5)

        plt.ylabel('Frequency', fontsize=14)

        plt.xticks(fontsize=15)
        plt.yticks(fontsize=15)
        title = 'MANY missing values in rows, threshold: ' + str(threshold)
        plt.title(title,  fontsize=18)
        plt.show()

        i = i+1
In [77]:
azdias_cleaned_missing_few, azdias_cleaned_missing_many, threshold =   divide_into_subsets(azdias_cleaned, 9)
pairwise_comparison(10)
Threshold of missing rows in the row:  9

Total nrow: 891221

Few Missing Values. Will be kept.
774743

Lots of Missing Values. Will be deleted
116478

% of rows with lots of missing values. % of data deleted 
0.130694855709
In [78]:
azdias_cleaned_missing_few, azdias_cleaned_missing_many, threshold =   divide_into_subsets(azdias_cleaned, 30)
pairwise_comparison(10)
Threshold of missing rows in the row:  30

Total nrow: 891221

Few Missing Values. Will be kept.
798067

Lots of Missing Values. Will be deleted
93154

% of rows with lots of missing values. % of data deleted 
0.104524018173
In [79]:
azdias_cleaned_missing_few, azdias_cleaned_missing_many, threshold =   divide_into_subsets(azdias_cleaned, 42)
pairwise_comparison(10)
Threshold of missing rows in the row:  42

Total nrow: 891221

Few Missing Values. Will be kept.
817622

Lots of Missing Values. Will be deleted
73599

% of rows with lots of missing values. % of data deleted 
0.0825822102486
In [80]:
# I'm re-running this code to make sure I'm actually usingng the threshold I decided on 
# (I was playing a lot above with different thresholds)
azdias_cleaned_missing_few, azdias_cleaned_missing_many, threshold =   divide_into_subsets(azdias_cleaned, 30)
Threshold of missing rows in the row:  30

Total nrow: 891221

Few Missing Values. Will be kept.
798067

Lots of Missing Values. Will be deleted
93154

% of rows with lots of missing values. % of data deleted 
0.104524018173
In [81]:
azdias_cleaned_missing_few.shape
Out[81]:
(798067, 81)

Discussion 1.1.3: Assess Missing Data in Each Row

(Double-click this cell and replace this text with your own text, reporting your observations regarding missing data in rows. Are the data with lots of missing values are qualitatively different from data with few or no missing values?)

Data with lots of missing values in rows appear to be qualitatively different from data with few or no missing values. You can see this if you compare the above distributions side by side.

There are 3 clusters of data depending on how many values with missing data there are in each row.

I we can split the dataset into two subsets, depending on how much missing data there are in each row. We can use several possible thresholds: 30 and 42.

30 would split cluster 1 from the rest of the data (~10%).

42 would split cluster 3 from the rest of the data (~8% of the data). This option was used to subset the data. 8% of data has been removed. 817622 rows out of 891221 remained.

I ended up going with threshold of 30, to minimize missing data, while still following hte logic of my analysis (the 3 cluster of dataset based on missing data).

Step 1.2: Select and Re-Encode Features

Checking for missing data isn't the only way in which you can prepare a dataset for analysis. Since the unsupervised learning techniques to be used will only work on data that is encoded numerically, you need to make a few encoding changes or additional assumptions to be able to make progress. In addition, while almost all of the values in the dataset are encoded using numbers, not all of them represent numeric values. Check the third column of the feature summary (feat_info) for a summary of types of measurement.

  • For numeric and interval data, these features can be kept without changes.
  • Most of the variables in the dataset are ordinal in nature. While ordinal values may technically be non-linear in spacing, make the simplifying assumption that the ordinal variables can be treated as being interval in nature (that is, kept without any changes).
  • Special handling may be necessary for the remaining two variable types: categorical, and 'mixed'.

In the first two parts of this sub-step, you will perform an investigation of the categorical and mixed-type features and make a decision on each of them, whether you will keep, drop, or re-encode each. Then, in the last part, you will create a new data frame with only the selected and engineered columns.

Data wrangling is often the trickiest part of the data analysis process, and there's a lot of it to be done here. But stick with it: once you're done with this step, you'll be ready to get to the machine learning parts of the project!

In [82]:
feat_info.head()
Out[82]:
attribute information_level type missing_or_unknown
0 AGER_TYP person categorical [-1,0]
1 ALTERSKATEGORIE_GROB person ordinal [-1,0,9]
2 ANREDE_KZ person categorical [-1,0]
3 CJT_GESAMTTYP person categorical [0]
4 FINANZ_MINIMALIST person ordinal [-1]
In [83]:
# How many features are there of each data type?
feat_info['type'].value_counts()
Out[83]:
ordinal        49
categorical    21
mixed           7
numeric         7
interval        1
Name: type, dtype: int64

Step 1.2.1: Re-Encode Categorical Features

For categorical data, you would ordinarily need to encode the levels as dummy variables. Depending on the number of categories, perform one of the following:

  • For binary (two-level) categoricals that take numeric values, you can keep them without needing to do anything.
  • There is one binary variable that takes on non-numeric values. For this one, you need to re-encode the values as numbers or create a dummy variable.
  • For multi-level categoricals (three or more values), you can choose to encode the values using multiple dummy variables (e.g. via OneHotEncoder), or (to keep things straightforward) just drop them from the analysis. As always, document your choices in the Discussion section.
In [84]:
categorical_vars = feat_info[feat_info['type']=='categorical']

len(categorical_vars)
Out[84]:
21
In [85]:
list(categorical_vars['attribute'])
Out[85]:
['AGER_TYP',
 'ANREDE_KZ',
 'CJT_GESAMTTYP',
 'FINANZTYP',
 'GFK_URLAUBERTYP',
 'GREEN_AVANTGARDE',
 'LP_FAMILIE_FEIN',
 'LP_FAMILIE_GROB',
 'LP_STATUS_FEIN',
 'LP_STATUS_GROB',
 'NATIONALITAET_KZ',
 'SHOPPER_TYP',
 'SOHO_KZ',
 'TITEL_KZ',
 'VERS_TYP',
 'ZABEOTYP',
 'KK_KUNDENTYP',
 'GEBAEUDETYP',
 'OST_WEST_KZ',
 'CAMEO_DEUG_2015',
 'CAMEO_DEU_2015']
In [86]:
# Assess categorical variables: which are binary, which are multi-level, and
# which one needs to be re-encoded?

"""
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isnull.html
#https://stackoverflow.com/questions/36198118/np-isnan-on-arrays-of-dtype-object
"""
categorical_vars = feat_info[feat_info['type']=='categorical']

len(categorical_vars)

categorical_vars = list(categorical_vars['attribute'])

binary = []
multilevel = []

for variable in categorical_vars:
    print(variable)
    print()
    # is this categorical column removed from the cleaned dataset? Because it has too much missing info
    if variable in (list(outliers_df['Column'])):
        print("This column has been removed from azdias_cleaned_missing_few as having too much missing data.")
    else:
        
        # these levels includes nan
        unique_levels = list(azdias_cleaned_missing_few[variable].unique())
        # number of unique levels, excluding nan
        number_of_unique_levels = len(unique_levels) - sum(pd.isnull(unique_levels))
        print("Number of Unique levels: ", number_of_unique_levels)
        print()
        
        # is it multilevel of binary?
        if number_of_unique_levels > 2:
            print("Multilevel Categorical")
            multilevel.append(variable)
        else: 
            print("Binary Categorical")
            binary.append(variable)
        print()
        print("Unique Levels: ", unique_levels)        
        print()
    print("###################################################################")   
AGER_TYP

This column has been removed from azdias_cleaned_missing_few as having too much missing data.
###################################################################
ANREDE_KZ

Number of Unique levels:  2

Binary Categorical

Unique Levels:  [2.0, 1.0]

###################################################################
CJT_GESAMTTYP

Number of Unique levels:  6

Multilevel Categorical

Unique Levels:  [5.0, 3.0, 2.0, 4.0, 1.0, 6.0, nan]

###################################################################
FINANZTYP

Number of Unique levels:  6

Multilevel Categorical

Unique Levels:  [1.0, 6.0, 5.0, 2.0, 4.0, 3.0]

###################################################################
GFK_URLAUBERTYP

Number of Unique levels:  12

Multilevel Categorical

Unique Levels:  [10.0, 1.0, 5.0, 12.0, 9.0, 3.0, 8.0, 11.0, 4.0, 2.0, 7.0, 6.0, nan]

###################################################################
GREEN_AVANTGARDE

Number of Unique levels:  2

Binary Categorical

Unique Levels:  [0.0, 1.0]

###################################################################
LP_FAMILIE_FEIN

Number of Unique levels:  11

Multilevel Categorical

Unique Levels:  [5.0, 1.0, nan, 10.0, 2.0, 7.0, 11.0, 8.0, 4.0, 6.0, 9.0, 3.0]

###################################################################
LP_FAMILIE_GROB

Number of Unique levels:  5

Multilevel Categorical

Unique Levels:  [3.0, 1.0, nan, 5.0, 2.0, 4.0]

###################################################################
LP_STATUS_FEIN

Number of Unique levels:  10

Multilevel Categorical

Unique Levels:  [2.0, 3.0, 9.0, 4.0, 1.0, 10.0, 5.0, 8.0, 6.0, 7.0, nan]

###################################################################
LP_STATUS_GROB

Number of Unique levels:  5

Multilevel Categorical

Unique Levels:  [1.0, 2.0, 4.0, 5.0, 3.0, nan]

###################################################################
NATIONALITAET_KZ

Number of Unique levels:  3

Multilevel Categorical

Unique Levels:  [1.0, 3.0, 2.0, nan]

###################################################################
SHOPPER_TYP

Number of Unique levels:  4

Multilevel Categorical

Unique Levels:  [3.0, 2.0, 1.0, 0.0, nan]

###################################################################
SOHO_KZ

Number of Unique levels:  2

Binary Categorical

Unique Levels:  [1.0, 0.0]

###################################################################
TITEL_KZ

This column has been removed from azdias_cleaned_missing_few as having too much missing data.
###################################################################
VERS_TYP

Number of Unique levels:  2

Binary Categorical

Unique Levels:  [2.0, 1.0, nan]

###################################################################
ZABEOTYP

Number of Unique levels:  6

Multilevel Categorical

Unique Levels:  [5.0, 3.0, 4.0, 1.0, 6.0, 2.0]

###################################################################
KK_KUNDENTYP

This column has been removed from azdias_cleaned_missing_few as having too much missing data.
###################################################################
GEBAEUDETYP

Number of Unique levels:  7

Multilevel Categorical

Unique Levels:  [8.0, 1.0, 3.0, 2.0, 6.0, 4.0, 5.0]

###################################################################
OST_WEST_KZ

Number of Unique levels:  2

Binary Categorical

Unique Levels:  ['W', 'O']

###################################################################
CAMEO_DEUG_2015

Number of Unique levels:  9

Multilevel Categorical

Unique Levels:  ['8', '4', '2', '6', '1', '9', '5', '7', nan, '3']

###################################################################
CAMEO_DEU_2015

Number of Unique levels:  44

Multilevel Categorical

Unique Levels:  ['8A', '4C', '2A', '6B', '8C', '4A', '2D', '1A', '1E', '9D', '5C', '8B', '7A', '5D', '9E', nan, '9B', '1B', '3D', '4E', '4B', '3C', '5A', '7B', '9A', '6D', '6E', '2C', '7C', '9C', '7D', '5E', '1D', '8D', '6C', '6A', '5B', '4D', '3A', '2B', '7E', '3B', '6F', '5F', '1C']

###################################################################
In [87]:
print("Binary Categoricals: ")    
print(binary)  
print()
print("Multilevel Categoricals: ")
print(multilevel)
Binary Categoricals: 
['ANREDE_KZ', 'GREEN_AVANTGARDE', 'SOHO_KZ', 'VERS_TYP', 'OST_WEST_KZ']

Multilevel Categoricals: 
['CJT_GESAMTTYP', 'FINANZTYP', 'GFK_URLAUBERTYP', 'LP_FAMILIE_FEIN', 'LP_FAMILIE_GROB', 'LP_STATUS_FEIN', 'LP_STATUS_GROB', 'NATIONALITAET_KZ', 'SHOPPER_TYP', 'ZABEOTYP', 'GEBAEUDETYP', 'CAMEO_DEUG_2015', 'CAMEO_DEU_2015']

This one needs to be recoded:

OST_WEST_KZ

Number of Unique levels: 2

Binary

Unique Levels: [nan, 'W', 'O']

In [88]:
azdias_cleaned_missing_few.head()
Out[88]:
ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER FINANZTYP GFK_URLAUBERTYP GREEN_AVANTGARDE HEALTH_TYP LP_LEBENSPHASE_FEIN LP_LEBENSPHASE_GROB LP_FAMILIE_FEIN LP_FAMILIE_GROB LP_STATUS_FEIN LP_STATUS_GROB NATIONALITAET_KZ PRAEGENDE_JUGENDJAHRE RETOURTYP_BK_S SEMIO_SOZ SEMIO_FAM SEMIO_REL SEMIO_MAT SEMIO_VERT SEMIO_LUST SEMIO_ERL SEMIO_KULT SEMIO_RAT SEMIO_KRIT SEMIO_DOM SEMIO_KAEM SEMIO_PFLICHT SEMIO_TRADV SHOPPER_TYP SOHO_KZ VERS_TYP ZABEOTYP ANZ_PERSONEN ANZ_TITEL HH_EINKOMMEN_SCORE W_KEIT_KIND_HH WOHNDAUER_2008 ANZ_HAUSHALTE_AKTIV ANZ_HH_TITEL GEBAEUDETYP KONSUMNAEHE MIN_GEBAEUDEJAHR OST_WEST_KZ WOHNLAGE CAMEO_DEUG_2015 CAMEO_DEU_2015 CAMEO_INTL_2015 KBA05_ANTG1 KBA05_ANTG2 KBA05_ANTG3 KBA05_ANTG4 KBA05_GBZ BALLRAUM EWDICHTE INNENSTADT GEBAEUDETYP_RASTER KKK MOBI_REGIO ONLINE_AFFINITAET REGIOTYP KBA13_ANZAHL_PKW PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB NAN_count many_missing_values_in_row
1 1.0 2.0 5.0 1.0 5.0 2.0 5.0 4.0 5.0 1.0 10.0 0.0 3.0 21.0 6.0 5.0 3.0 2.0 1.0 1.0 14.0 1.0 5.0 4.0 4.0 3.0 1.0 2.0 2.0 3.0 6.0 4.0 7.0 4.0 7.0 6.0 3.0 1.0 2.0 5.0 2.0 0.0 6.0 3.0 9.0 11.0 0.0 8.0 1.0 1992.0 W 4.0 8 8A 51 0.0 0.0 0.0 2.0 1.0 6.0 3.0 8.0 3.0 2.0 1.0 3.0 3.0 963.0 2.0 3.0 2.0 1.0 1.0 5.0 4.0 3.0 5.0 4.0 0 0
2 3.0 2.0 3.0 1.0 4.0 1.0 2.0 3.0 5.0 1.0 10.0 1.0 3.0 3.0 1.0 1.0 1.0 3.0 2.0 1.0 15.0 3.0 4.0 1.0 3.0 3.0 4.0 4.0 6.0 3.0 4.0 7.0 7.0 7.0 3.0 3.0 2.0 0.0 1.0 5.0 1.0 0.0 4.0 3.0 9.0 10.0 0.0 1.0 5.0 1992.0 W 2.0 4 4C 24 1.0 3.0 1.0 0.0 3.0 2.0 4.0 4.0 4.0 2.0 3.0 2.0 2.0 712.0 3.0 3.0 1.0 0.0 1.0 4.0 4.0 3.0 5.0 2.0 0 0
3 4.0 2.0 2.0 4.0 2.0 5.0 2.0 1.0 2.0 6.0 1.0 0.0 2.0 NaN NaN NaN NaN 9.0 4.0 1.0 8.0 2.0 5.0 1.0 2.0 1.0 4.0 4.0 7.0 4.0 3.0 4.0 4.0 5.0 4.0 4.0 1.0 0.0 1.0 3.0 0.0 0.0 1.0 NaN 9.0 1.0 0.0 1.0 4.0 1997.0 W 7.0 2 2A 12 4.0 1.0 0.0 0.0 4.0 4.0 2.0 6.0 4.0 NaN 4.0 1.0 NaN 596.0 2.0 2.0 2.0 0.0 1.0 3.0 4.0 2.0 3.0 3.0 7 0
4 3.0 1.0 5.0 4.0 3.0 4.0 1.0 3.0 2.0 5.0 5.0 0.0 3.0 32.0 10.0 10.0 5.0 3.0 2.0 1.0 8.0 5.0 6.0 4.0 4.0 2.0 7.0 4.0 4.0 6.0 2.0 3.0 2.0 2.0 4.0 2.0 2.0 0.0 2.0 4.0 4.0 0.0 5.0 2.0 9.0 3.0 0.0 1.0 4.0 1992.0 W 3.0 6 6B 43 1.0 4.0 1.0 0.0 3.0 2.0 5.0 1.0 5.0 3.0 3.0 5.0 5.0 435.0 2.0 4.0 2.0 1.0 2.0 3.0 3.0 4.0 6.0 5.0 0 0
5 1.0 2.0 2.0 3.0 1.0 5.0 2.0 2.0 5.0 2.0 1.0 0.0 3.0 8.0 2.0 1.0 1.0 4.0 2.0 1.0 3.0 3.0 2.0 4.0 7.0 4.0 2.0 2.0 2.0 5.0 7.0 4.0 4.0 4.0 7.0 6.0 0.0 0.0 2.0 4.0 1.0 0.0 5.0 6.0 9.0 5.0 0.0 1.0 5.0 1992.0 W 7.0 8 8C 54 2.0 2.0 0.0 0.0 4.0 6.0 2.0 7.0 4.0 4.0 4.0 1.0 5.0 1300.0 2.0 3.0 1.0 1.0 1.0 5.0 5.0 2.0 3.0 3.0 0 0
In [89]:
# Re-encode categorical variable(s) to be kept in the analysis.

# print(azdias_cleaned_missing_few['OST_WEST_KZ'].head(100))

azdias_cleaned_missing_few['OST_WEST_KZ'].tail(100)
Out[89]:
891112    O
891113    O
891114    W
891115    W
891116    W
891117    W
891118    W
891119    W
891120    W
891121    W
891122    W
891123    W
891124    W
891125    W
891126    W
891127    W
891128    W
891129    W
891131    W
891132    W
891133    W
891134    W
891135    W
891136    W
891138    O
891139    O
891141    W
891142    W
891143    W
891144    W
891145    W
891146    W
891147    W
891148    W
891149    W
891150    W
891151    W
891152    W
891153    W
891154    W
891155    W
891156    W
891157    W
891158    W
891160    W
891161    W
891162    W
891163    W
891165    W
891166    W
891167    W
891168    W
891169    W
891170    W
891172    W
891173    W
891174    W
891176    W
891177    W
891178    W
891179    W
891180    W
891181    W
891182    W
891183    W
891184    W
891186    W
891188    W
891189    W
891190    W
891191    W
891192    W
891193    W
891194    W
891195    W
891196    W
891197    W
891198    W
891199    W
891200    W
891201    W
891202    W
891203    W
891204    W
891205    W
891206    W
891207    W
891208    W
891209    W
891210    W
891211    W
891212    W
891213    W
891214    W
891215    W
891216    W
891217    W
891218    W
891219    W
891220    W
Name: OST_WEST_KZ, dtype: object
In [90]:
print(azdias_cleaned.shape)
print(azdias_cleaned_missing_few.shape)
(891221, 81)
(798067, 81)
In [91]:
print(azdias_cleaned.OST_WEST_KZ.value_counts())
bar_graph_for_each_column(azdias_cleaned,column_list=['OST_WEST_KZ']) 

# this gives me the same result:
# print(azdias_cleaned_missing_few.OST_WEST_KZ.value_counts())
# bar_graph_for_each_column(azdias_cleaned_missing_few,column_list=['OST_WEST_KZ'])  
W    629528
O    168545
Name: OST_WEST_KZ, dtype: int64
OST_WEST_KZ
   counts  percentage
W  629528   78.881005
O  168545   21.118995
In [92]:
"""Deep Learning. Lesson 1 Intro to Neural Networks. 36 Notebook: Analyzing Student Data"""
# Make dummy variables for OST_WEST_KZ
azdias_cleaned_encoded = pd.concat([azdias_cleaned_missing_few, pd.get_dummies(azdias_cleaned_missing_few['OST_WEST_KZ'], prefix='OST_WEST_KZ')], axis=1)
In [93]:
azdias_cleaned_encoded.shape
Out[93]:
(798067, 83)
In [94]:
azdias_cleaned_encoded.head()
Out[94]:
ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER FINANZTYP GFK_URLAUBERTYP GREEN_AVANTGARDE HEALTH_TYP LP_LEBENSPHASE_FEIN LP_LEBENSPHASE_GROB LP_FAMILIE_FEIN LP_FAMILIE_GROB LP_STATUS_FEIN LP_STATUS_GROB NATIONALITAET_KZ PRAEGENDE_JUGENDJAHRE RETOURTYP_BK_S SEMIO_SOZ SEMIO_FAM SEMIO_REL SEMIO_MAT SEMIO_VERT SEMIO_LUST SEMIO_ERL SEMIO_KULT SEMIO_RAT SEMIO_KRIT SEMIO_DOM SEMIO_KAEM SEMIO_PFLICHT SEMIO_TRADV SHOPPER_TYP SOHO_KZ VERS_TYP ZABEOTYP ANZ_PERSONEN ANZ_TITEL HH_EINKOMMEN_SCORE W_KEIT_KIND_HH WOHNDAUER_2008 ANZ_HAUSHALTE_AKTIV ANZ_HH_TITEL GEBAEUDETYP KONSUMNAEHE MIN_GEBAEUDEJAHR OST_WEST_KZ WOHNLAGE CAMEO_DEUG_2015 CAMEO_DEU_2015 CAMEO_INTL_2015 KBA05_ANTG1 KBA05_ANTG2 KBA05_ANTG3 KBA05_ANTG4 KBA05_GBZ BALLRAUM EWDICHTE INNENSTADT GEBAEUDETYP_RASTER KKK MOBI_REGIO ONLINE_AFFINITAET REGIOTYP KBA13_ANZAHL_PKW PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB NAN_count many_missing_values_in_row OST_WEST_KZ_O OST_WEST_KZ_W
1 1.0 2.0 5.0 1.0 5.0 2.0 5.0 4.0 5.0 1.0 10.0 0.0 3.0 21.0 6.0 5.0 3.0 2.0 1.0 1.0 14.0 1.0 5.0 4.0 4.0 3.0 1.0 2.0 2.0 3.0 6.0 4.0 7.0 4.0 7.0 6.0 3.0 1.0 2.0 5.0 2.0 0.0 6.0 3.0 9.0 11.0 0.0 8.0 1.0 1992.0 W 4.0 8 8A 51 0.0 0.0 0.0 2.0 1.0 6.0 3.0 8.0 3.0 2.0 1.0 3.0 3.0 963.0 2.0 3.0 2.0 1.0 1.0 5.0 4.0 3.0 5.0 4.0 0 0 0 1
2 3.0 2.0 3.0 1.0 4.0 1.0 2.0 3.0 5.0 1.0 10.0 1.0 3.0 3.0 1.0 1.0 1.0 3.0 2.0 1.0 15.0 3.0 4.0 1.0 3.0 3.0 4.0 4.0 6.0 3.0 4.0 7.0 7.0 7.0 3.0 3.0 2.0 0.0 1.0 5.0 1.0 0.0 4.0 3.0 9.0 10.0 0.0 1.0 5.0 1992.0 W 2.0 4 4C 24 1.0 3.0 1.0 0.0 3.0 2.0 4.0 4.0 4.0 2.0 3.0 2.0 2.0 712.0 3.0 3.0 1.0 0.0 1.0 4.0 4.0 3.0 5.0 2.0 0 0 0 1
3 4.0 2.0 2.0 4.0 2.0 5.0 2.0 1.0 2.0 6.0 1.0 0.0 2.0 NaN NaN NaN NaN 9.0 4.0 1.0 8.0 2.0 5.0 1.0 2.0 1.0 4.0 4.0 7.0 4.0 3.0 4.0 4.0 5.0 4.0 4.0 1.0 0.0 1.0 3.0 0.0 0.0 1.0 NaN 9.0 1.0 0.0 1.0 4.0 1997.0 W 7.0 2 2A 12 4.0 1.0 0.0 0.0 4.0 4.0 2.0 6.0 4.0 NaN 4.0 1.0 NaN 596.0 2.0 2.0 2.0 0.0 1.0 3.0 4.0 2.0 3.0 3.0 7 0 0 1
4 3.0 1.0 5.0 4.0 3.0 4.0 1.0 3.0 2.0 5.0 5.0 0.0 3.0 32.0 10.0 10.0 5.0 3.0 2.0 1.0 8.0 5.0 6.0 4.0 4.0 2.0 7.0 4.0 4.0 6.0 2.0 3.0 2.0 2.0 4.0 2.0 2.0 0.0 2.0 4.0 4.0 0.0 5.0 2.0 9.0 3.0 0.0 1.0 4.0 1992.0 W 3.0 6 6B 43 1.0 4.0 1.0 0.0 3.0 2.0 5.0 1.0 5.0 3.0 3.0 5.0 5.0 435.0 2.0 4.0 2.0 1.0 2.0 3.0 3.0 4.0 6.0 5.0 0 0 0 1
5 1.0 2.0 2.0 3.0 1.0 5.0 2.0 2.0 5.0 2.0 1.0 0.0 3.0 8.0 2.0 1.0 1.0 4.0 2.0 1.0 3.0 3.0 2.0 4.0 7.0 4.0 2.0 2.0 2.0 5.0 7.0 4.0 4.0 4.0 7.0 6.0 0.0 0.0 2.0 4.0 1.0 0.0 5.0 6.0 9.0 5.0 0.0 1.0 5.0 1992.0 W 7.0 8 8C 54 2.0 2.0 0.0 0.0 4.0 6.0 2.0 7.0 4.0 4.0 4.0 1.0 5.0 1300.0 2.0 3.0 1.0 1.0 1.0 5.0 5.0 2.0 3.0 3.0 0 0 0 1
In [95]:
azdias_cleaned_encoded[['OST_WEST_KZ', 'OST_WEST_KZ_O', 'OST_WEST_KZ_W']].tail(100)
Out[95]:
OST_WEST_KZ OST_WEST_KZ_O OST_WEST_KZ_W
891112 O 1 0
891113 O 1 0
891114 W 0 1
891115 W 0 1
891116 W 0 1
891117 W 0 1
891118 W 0 1
891119 W 0 1
891120 W 0 1
891121 W 0 1
891122 W 0 1
891123 W 0 1
891124 W 0 1
891125 W 0 1
891126 W 0 1
891127 W 0 1
891128 W 0 1
891129 W 0 1
891131 W 0 1
891132 W 0 1
891133 W 0 1
891134 W 0 1
891135 W 0 1
891136 W 0 1
891138 O 1 0
891139 O 1 0
891141 W 0 1
891142 W 0 1
891143 W 0 1
891144 W 0 1
891145 W 0 1
891146 W 0 1
891147 W 0 1
891148 W 0 1
891149 W 0 1
891150 W 0 1
891151 W 0 1
891152 W 0 1
891153 W 0 1
891154 W 0 1
891155 W 0 1
891156 W 0 1
891157 W 0 1
891158 W 0 1
891160 W 0 1
891161 W 0 1
891162 W 0 1
891163 W 0 1
891165 W 0 1
891166 W 0 1
891167 W 0 1
891168 W 0 1
891169 W 0 1
891170 W 0 1
891172 W 0 1
891173 W 0 1
891174 W 0 1
891176 W 0 1
891177 W 0 1
891178 W 0 1
891179 W 0 1
891180 W 0 1
891181 W 0 1
891182 W 0 1
891183 W 0 1
891184 W 0 1
891186 W 0 1
891188 W 0 1
891189 W 0 1
891190 W 0 1
891191 W 0 1
891192 W 0 1
891193 W 0 1
891194 W 0 1
891195 W 0 1
891196 W 0 1
891197 W 0 1
891198 W 0 1
891199 W 0 1
891200 W 0 1
891201 W 0 1
891202 W 0 1
891203 W 0 1
891204 W 0 1
891205 W 0 1
891206 W 0 1
891207 W 0 1
891208 W 0 1
891209 W 0 1
891210 W 0 1
891211 W 0 1
891212 W 0 1
891213 W 0 1
891214 W 0 1
891215 W 0 1
891216 W 0 1
891217 W 0 1
891218 W 0 1
891219 W 0 1
891220 W 0 1
In [96]:
pd.crosstab(azdias_cleaned_encoded.OST_WEST_KZ, azdias_cleaned_encoded.OST_WEST_KZ_O)
Out[96]:
OST_WEST_KZ_O 0 1
OST_WEST_KZ
O 0 168542
W 629525 0
In [97]:
pd.crosstab(azdias_cleaned_encoded.OST_WEST_KZ, azdias_cleaned_encoded.OST_WEST_KZ_W)
Out[97]:
OST_WEST_KZ_W 0 1
OST_WEST_KZ
O 168542 0
W 0 629525
In [98]:
pd.crosstab(azdias_cleaned_encoded.OST_WEST_KZ_O, azdias_cleaned_encoded.OST_WEST_KZ_W)
Out[98]:
OST_WEST_KZ_W 0 1
OST_WEST_KZ_O
0 0 629525
1 168542 0
In [99]:
# Drop the previous OST_WEST_KZ column
print(azdias_cleaned_encoded.shape)
# we actually don't need both dummy vars. 2nd one doesn't add anything. We can drop OST_WEST_KZ_W and keep OST_WEST_KZ_O
azdias_cleaned_encoded = azdias_cleaned_encoded.drop(['OST_WEST_KZ_W','OST_WEST_KZ'], axis = 1)
print(azdias_cleaned_encoded.shape)
(798067, 83)
(798067, 81)
In [100]:
print(len(multilevel))
print(azdias_cleaned_encoded.shape)
azdias_cleaned_encoded = azdias_cleaned_encoded.drop(multilevel, axis = 1)
print(azdias_cleaned_encoded.shape)
13
(798067, 81)
(798067, 68)

Discussion 1.2.1: Re-Encode Categorical Features

(Double-click this cell and replace this text with your own text, reporting your findings and decisions regarding categorical features. Which ones did you keep, which did you drop, and what engineering steps did you perform?)

Categorical variables

Kept untouched

Binary Categoricals: ['ANREDE_KZ', 'GREEN_AVANTGARDE', 'SOHO_KZ', 'VERS_TYP']

Recoded as dummy variable

Binary Categorical: 'OST_WEST_KZ'

Dropped from analysis just to keep things straighforward and also avoid overfitting due to having too many independent variables:

Multilevel Categoricals: ['CJT_GESAMTTYP', 'FINANZTYP', 'GFK_URLAUBERTYP', 'LP_FAMILIE_FEIN', 'LP_FAMILIE_GROB', 'LP_STATUS_FEIN', 'LP_STATUS_GROB', 'NATIONALITAET_KZ', 'SHOPPER_TYP', 'ZABEOTYP', 'GEBAEUDETYP', 'CAMEO_DEUG_2015', 'CAMEO_DEU_2015']

When I was working on the very last part of the project, description of the customer clusters, I regretted dropping all of these vars. I should have kept a couple of them at least, indicating shopper type.

Step 1.2.2: Engineer Mixed-Type Features

There are a handful of features that are marked as "mixed" in the feature summary that require special treatment in order to be included in the analysis. There are two in particular that deserve attention; the handling of the rest are up to your own choices:

  • "PRAEGENDE_JUGENDJAHRE" combines information on three dimensions: generation by decade, movement (mainstream vs. avantgarde), and nation (east vs. west). While there aren't enough levels to disentangle east from west, you should create two new variables to capture the other two dimensions: an interval-type variable for decade, and a binary variable for movement.
  • "CAMEO_INTL_2015" combines information on two axes: wealth and life stage. Break up the two-digit codes by their 'tens'-place and 'ones'-place digits into two new ordinal variables (which, for the purposes of this project, is equivalent to just treating them as their raw numeric values).
  • If you decide to keep or engineer new features around the other mixed-type features, make sure you note your steps in the Discussion section.

Be sure to check Data_Dictionary.md for the details needed to finish these tasks.

In [101]:
#  How many features are there of each data type?
feat_info['type'].value_counts()
Out[101]:
ordinal        49
categorical    21
mixed           7
numeric         7
interval        1
Name: type, dtype: int64
Investigate "PRAEGENDE_JUGENDJAHRE" and engineer two new variables.
In [102]:
azdias_cleaned_encoded["PRAEGENDE_JUGENDJAHRE"].value_counts()
Out[102]:
14.0    182985
8.0     141630
10.0     85808
5.0      84692
3.0      53845
15.0     42547
11.0     35752
9.0      33570
6.0      25652
12.0     24446
1.0      20678
4.0      20451
2.0       7479
13.0      5764
7.0       4010
Name: PRAEGENDE_JUGENDJAHRE, dtype: int64
In [103]:
# for item in my_dict:
#     print(item)
#     print(my_dict[item])
In [104]:
def recode_column_according_2_my_dict(df, oldvar, newvar, my_dict):
    """recode a column according to how it's specified in my dictionary my_dict"""

    for item in my_dict:
        condition = df[oldvar].isin(my_dict[item])
        df[newvar] = np.where(condition==True, item, df[newvar])

    """
    https://stackoverflow.com/questions/33271098/python-get-a-frequency-count-based-on-two-columns-variables-in-pandas-datafra
    """
    return pd.crosstab(df[oldvar], df[newvar])  
In [105]:
#https://stackoverflow.com/questions/19913659/pandas-conditional-creation-of-a-series-dataframe-column

# initialize column
azdias_cleaned_encoded['PRAEGENDE_JUGENDJAHRE_DECADE'] = 0

# here's how the column will be recoded
my_dict = {
    40:[1,2],
    50:[3,4],
    60:[5,6,7],
    70:[8,9],
    80:[10,11,12,13],
    90:[14,15]
}
recode_column_according_2_my_dict(azdias_cleaned_encoded, "PRAEGENDE_JUGENDJAHRE", "PRAEGENDE_JUGENDJAHRE_DECADE", my_dict)  
Out[105]:
PRAEGENDE_JUGENDJAHRE_DECADE 40 50 60 70 80 90
PRAEGENDE_JUGENDJAHRE
1.0 20678 0 0 0 0 0
2.0 7479 0 0 0 0 0
3.0 0 53845 0 0 0 0
4.0 0 20451 0 0 0 0
5.0 0 0 84692 0 0 0
6.0 0 0 25652 0 0 0
7.0 0 0 4010 0 0 0
8.0 0 0 0 141630 0 0
9.0 0 0 0 33570 0 0
10.0 0 0 0 0 85808 0
11.0 0 0 0 0 35752 0
12.0 0 0 0 0 24446 0
13.0 0 0 0 0 5764 0
14.0 0 0 0 0 0 182985
15.0 0 0 0 0 0 42547
In [106]:
azdias_cleaned_encoded[["PRAEGENDE_JUGENDJAHRE","PRAEGENDE_JUGENDJAHRE_DECADE"]].head(100)
Out[106]:
PRAEGENDE_JUGENDJAHRE PRAEGENDE_JUGENDJAHRE_DECADE
1 14.0 90
2 15.0 90
3 8.0 70
4 8.0 70
5 3.0 50
6 10.0 80
7 8.0 70
8 11.0 80
9 15.0 90
10 3.0 50
12 8.0 70
13 15.0 90
15 5.0 60
16 14.0 90
18 10.0 80
19 9.0 70
20 11.0 80
21 10.0 80
22 14.0 90
23 9.0 70
25 14.0 90
26 8.0 70
27 11.0 80
28 10.0 80
29 6.0 60
31 10.0 80
32 15.0 90
33 8.0 70
34 4.0 50
36 8.0 70
37 8.0 70
38 10.0 80
39 11.0 80
40 NaN 0
41 15.0 90
42 14.0 90
43 14.0 90
44 8.0 70
45 5.0 60
47 4.0 50
49 10.0 80
50 14.0 90
51 15.0 90
52 14.0 90
55 8.0 70
56 11.0 80
57 14.0 90
58 15.0 90
59 10.0 80
60 8.0 70
63 15.0 90
64 14.0 90
65 8.0 70
66 14.0 90
67 5.0 60
68 5.0 60
70 11.0 80
71 8.0 70
72 NaN 0
73 10.0 80
74 11.0 80
77 8.0 70
78 10.0 80
79 15.0 90
80 10.0 80
82 14.0 90
84 8.0 70
85 5.0 60
86 3.0 50
87 11.0 80
88 10.0 80
89 14.0 90
91 15.0 90
92 14.0 90
93 5.0 60
94 5.0 60
95 14.0 90
96 10.0 80
98 15.0 90
100 2.0 40
101 8.0 70
102 5.0 60
104 10.0 80
105 11.0 80
107 10.0 80
110 10.0 80
111 8.0 70
113 4.0 50
114 5.0 60
115 6.0 60
116 10.0 80
117 15.0 90
118 8.0 70
119 1.0 40
120 5.0 60
121 6.0 60
122 14.0 90
123 11.0 80
124 9.0 70
125 5.0 60
In [107]:
azdias_cleaned_encoded['PRAEGENDE_JUGENDJAHRE_MOVEMENT_TEMP'] = ""

my_dict ={
    "AVANTGARDE": [2, 4, 6, 7, 9, 11, 13, 15],
    "MAINSTREAM": [1, 3, 5, 8, 10, 12, 14]    
}

recode_column_according_2_my_dict(azdias_cleaned_encoded, "PRAEGENDE_JUGENDJAHRE", "PRAEGENDE_JUGENDJAHRE_MOVEMENT_TEMP", my_dict)
Out[107]:
PRAEGENDE_JUGENDJAHRE_MOVEMENT_TEMP AVANTGARDE MAINSTREAM
PRAEGENDE_JUGENDJAHRE
1.0 0 20678
2.0 7479 0
3.0 0 53845
4.0 20451 0
5.0 0 84692
6.0 25652 0
7.0 4010 0
8.0 0 141630
9.0 33570 0
10.0 0 85808
11.0 35752 0
12.0 0 24446
13.0 5764 0
14.0 0 182985
15.0 42547 0
In [108]:
azdias_cleaned_encoded[["PRAEGENDE_JUGENDJAHRE","PRAEGENDE_JUGENDJAHRE_MOVEMENT_TEMP"]].head(100)
Out[108]:
PRAEGENDE_JUGENDJAHRE PRAEGENDE_JUGENDJAHRE_MOVEMENT_TEMP
1 14.0 MAINSTREAM
2 15.0 AVANTGARDE
3 8.0 MAINSTREAM
4 8.0 MAINSTREAM
5 3.0 MAINSTREAM
6 10.0 MAINSTREAM
7 8.0 MAINSTREAM
8 11.0 AVANTGARDE
9 15.0 AVANTGARDE
10 3.0 MAINSTREAM
12 8.0 MAINSTREAM
13 15.0 AVANTGARDE
15 5.0 MAINSTREAM
16 14.0 MAINSTREAM
18 10.0 MAINSTREAM
19 9.0 AVANTGARDE
20 11.0 AVANTGARDE
21 10.0 MAINSTREAM
22 14.0 MAINSTREAM
23 9.0 AVANTGARDE
25 14.0 MAINSTREAM
26 8.0 MAINSTREAM
27 11.0 AVANTGARDE
28 10.0 MAINSTREAM
29 6.0 AVANTGARDE
31 10.0 MAINSTREAM
32 15.0 AVANTGARDE
33 8.0 MAINSTREAM
34 4.0 AVANTGARDE
36 8.0 MAINSTREAM
37 8.0 MAINSTREAM
38 10.0 MAINSTREAM
39 11.0 AVANTGARDE
40 NaN
41 15.0 AVANTGARDE
42 14.0 MAINSTREAM
43 14.0 MAINSTREAM
44 8.0 MAINSTREAM
45 5.0 MAINSTREAM
47 4.0 AVANTGARDE
49 10.0 MAINSTREAM
50 14.0 MAINSTREAM
51 15.0 AVANTGARDE
52 14.0 MAINSTREAM
55 8.0 MAINSTREAM
56 11.0 AVANTGARDE
57 14.0 MAINSTREAM
58 15.0 AVANTGARDE
59 10.0 MAINSTREAM
60 8.0 MAINSTREAM
63 15.0 AVANTGARDE
64 14.0 MAINSTREAM
65 8.0 MAINSTREAM
66 14.0 MAINSTREAM
67 5.0 MAINSTREAM
68 5.0 MAINSTREAM
70 11.0 AVANTGARDE
71 8.0 MAINSTREAM
72 NaN
73 10.0 MAINSTREAM
74 11.0 AVANTGARDE
77 8.0 MAINSTREAM
78 10.0 MAINSTREAM
79 15.0 AVANTGARDE
80 10.0 MAINSTREAM
82 14.0 MAINSTREAM
84 8.0 MAINSTREAM
85 5.0 MAINSTREAM
86 3.0 MAINSTREAM
87 11.0 AVANTGARDE
88 10.0 MAINSTREAM
89 14.0 MAINSTREAM
91 15.0 AVANTGARDE
92 14.0 MAINSTREAM
93 5.0 MAINSTREAM
94 5.0 MAINSTREAM
95 14.0 MAINSTREAM
96 10.0 MAINSTREAM
98 15.0 AVANTGARDE
100 2.0 AVANTGARDE
101 8.0 MAINSTREAM
102 5.0 MAINSTREAM
104 10.0 MAINSTREAM
105 11.0 AVANTGARDE
107 10.0 MAINSTREAM
110 10.0 MAINSTREAM
111 8.0 MAINSTREAM
113 4.0 AVANTGARDE
114 5.0 MAINSTREAM
115 6.0 AVANTGARDE
116 10.0 MAINSTREAM
117 15.0 AVANTGARDE
118 8.0 MAINSTREAM
119 1.0 MAINSTREAM
120 5.0 MAINSTREAM
121 6.0 AVANTGARDE
122 14.0 MAINSTREAM
123 11.0 AVANTGARDE
124 9.0 AVANTGARDE
125 5.0 MAINSTREAM
In [109]:
"""Deep Learning. Lesson 1 Intro to Neural Networks. 36 Notebook: Analyzing Student Data"""
# Make dummy variables for PRAEGENDE_JUGENDJAHRE_MOVEMENT_TEMP
azdias_cleaned_encoded = pd.concat([azdias_cleaned_encoded, pd.get_dummies(azdias_cleaned_encoded['PRAEGENDE_JUGENDJAHRE_MOVEMENT_TEMP'], prefix='PRAEGENDE_JUGENDJAHRE_MOVEMENT')], axis=1)
In [110]:
pd.crosstab(azdias_cleaned_encoded['PRAEGENDE_JUGENDJAHRE_MOVEMENT_TEMP'], azdias_cleaned_encoded['PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE'])
Out[110]:
PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE 0 1
PRAEGENDE_JUGENDJAHRE_MOVEMENT_TEMP
28758 0
AVANTGARDE 0 175225
MAINSTREAM 594084 0
In [111]:
pd.crosstab(azdias_cleaned_encoded['PRAEGENDE_JUGENDJAHRE_MOVEMENT_TEMP'], azdias_cleaned_encoded['PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM'])
Out[111]:
PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM 0 1
PRAEGENDE_JUGENDJAHRE_MOVEMENT_TEMP
28758 0
AVANTGARDE 175225 0
MAINSTREAM 0 594084
In [112]:
pd.crosstab(azdias_cleaned_encoded['PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE'], azdias_cleaned_encoded['PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM'])
Out[112]:
PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM 0 1
PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE
0 28758 594084
1 175225 0
In [113]:
azdias_cleaned_encoded["PRAEGENDE_JUGENDJAHRE"].isna().sum()
Out[113]:
28758
In [114]:
# Drop the previous PRAEGENDE_JUGENDJAHRE_MOVEMENT_TEMP column. 
print(azdias_cleaned_encoded.shape)
azdias_cleaned_encoded = azdias_cleaned_encoded.drop(['PRAEGENDE_JUGENDJAHRE_MOVEMENT_TEMP'], axis = 1)
print(azdias_cleaned_encoded.shape)
(798067, 73)
(798067, 72)
In [115]:
# Drop the previous PRAEGENDE_JUGENDJAHRE
print(azdias_cleaned_encoded.shape)
azdias_cleaned_encoded = azdias_cleaned_encoded.drop('PRAEGENDE_JUGENDJAHRE', axis = 1)
print(azdias_cleaned_encoded.shape)
(798067, 72)
(798067, 71)
In [116]:
azdias_cleaned_encoded.head()
Out[116]:
ALTERSKATEGORIE_GROB ANREDE_KZ FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER GREEN_AVANTGARDE HEALTH_TYP LP_LEBENSPHASE_FEIN LP_LEBENSPHASE_GROB RETOURTYP_BK_S SEMIO_SOZ SEMIO_FAM SEMIO_REL SEMIO_MAT SEMIO_VERT SEMIO_LUST SEMIO_ERL SEMIO_KULT SEMIO_RAT SEMIO_KRIT SEMIO_DOM SEMIO_KAEM SEMIO_PFLICHT SEMIO_TRADV SOHO_KZ VERS_TYP ANZ_PERSONEN ANZ_TITEL HH_EINKOMMEN_SCORE W_KEIT_KIND_HH WOHNDAUER_2008 ANZ_HAUSHALTE_AKTIV ANZ_HH_TITEL KONSUMNAEHE MIN_GEBAEUDEJAHR WOHNLAGE CAMEO_INTL_2015 KBA05_ANTG1 KBA05_ANTG2 KBA05_ANTG3 KBA05_ANTG4 KBA05_GBZ BALLRAUM EWDICHTE INNENSTADT GEBAEUDETYP_RASTER KKK MOBI_REGIO ONLINE_AFFINITAET REGIOTYP KBA13_ANZAHL_PKW PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB NAN_count many_missing_values_in_row OST_WEST_KZ_O PRAEGENDE_JUGENDJAHRE_DECADE PRAEGENDE_JUGENDJAHRE_MOVEMENT_ PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM
1 1.0 2.0 1.0 5.0 2.0 5.0 4.0 5.0 0.0 3.0 21.0 6.0 1.0 5.0 4.0 4.0 3.0 1.0 2.0 2.0 3.0 6.0 4.0 7.0 4.0 7.0 6.0 1.0 2.0 2.0 0.0 6.0 3.0 9.0 11.0 0.0 1.0 1992.0 4.0 51 0.0 0.0 0.0 2.0 1.0 6.0 3.0 8.0 3.0 2.0 1.0 3.0 3.0 963.0 2.0 3.0 2.0 1.0 1.0 5.0 4.0 3.0 5.0 4.0 0 0 0 90 0 0 1
2 3.0 2.0 1.0 4.0 1.0 2.0 3.0 5.0 1.0 3.0 3.0 1.0 3.0 4.0 1.0 3.0 3.0 4.0 4.0 6.0 3.0 4.0 7.0 7.0 7.0 3.0 3.0 0.0 1.0 1.0 0.0 4.0 3.0 9.0 10.0 0.0 5.0 1992.0 2.0 24 1.0 3.0 1.0 0.0 3.0 2.0 4.0 4.0 4.0 2.0 3.0 2.0 2.0 712.0 3.0 3.0 1.0 0.0 1.0 4.0 4.0 3.0 5.0 2.0 0 0 0 90 0 1 0
3 4.0 2.0 4.0 2.0 5.0 2.0 1.0 2.0 0.0 2.0 NaN NaN 2.0 5.0 1.0 2.0 1.0 4.0 4.0 7.0 4.0 3.0 4.0 4.0 5.0 4.0 4.0 0.0 1.0 0.0 0.0 1.0 NaN 9.0 1.0 0.0 4.0 1997.0 7.0 12 4.0 1.0 0.0 0.0 4.0 4.0 2.0 6.0 4.0 NaN 4.0 1.0 NaN 596.0 2.0 2.0 2.0 0.0 1.0 3.0 4.0 2.0 3.0 3.0 7 0 0 70 0 0 1
4 3.0 1.0 4.0 3.0 4.0 1.0 3.0 2.0 0.0 3.0 32.0 10.0 5.0 6.0 4.0 4.0 2.0 7.0 4.0 4.0 6.0 2.0 3.0 2.0 2.0 4.0 2.0 0.0 2.0 4.0 0.0 5.0 2.0 9.0 3.0 0.0 4.0 1992.0 3.0 43 1.0 4.0 1.0 0.0 3.0 2.0 5.0 1.0 5.0 3.0 3.0 5.0 5.0 435.0 2.0 4.0 2.0 1.0 2.0 3.0 3.0 4.0 6.0 5.0 0 0 0 70 0 0 1
5 1.0 2.0 3.0 1.0 5.0 2.0 2.0 5.0 0.0 3.0 8.0 2.0 3.0 2.0 4.0 7.0 4.0 2.0 2.0 2.0 5.0 7.0 4.0 4.0 4.0 7.0 6.0 0.0 2.0 1.0 0.0 5.0 6.0 9.0 5.0 0.0 5.0 1992.0 7.0 54 2.0 2.0 0.0 0.0 4.0 6.0 2.0 7.0 4.0 4.0 4.0 1.0 5.0 1300.0 2.0 3.0 1.0 1.0 1.0 5.0 5.0 2.0 3.0 3.0 0 0 0 50 0 0 1
In [117]:
azdias_cleaned_encoded.PRAEGENDE_JUGENDJAHRE_MOVEMENT_.describe()
Out[117]:
count    798067.000000
mean          0.036035
std           0.186376
min           0.000000
25%           0.000000
50%           0.000000
75%           0.000000
max           1.000000
Name: PRAEGENDE_JUGENDJAHRE_MOVEMENT_, dtype: float64
In [118]:
azdias_cleaned_encoded.PRAEGENDE_JUGENDJAHRE_MOVEMENT_.sum()
Out[118]:
28758
In [119]:
# Drop the previous PRAEGENDE_JUGENDJAHRE_MOVEMENT_. This column just indicated sum(NAN) in PRAEGENDE_JUGENDJAHRE
print(azdias_cleaned_encoded.shape)
azdias_cleaned_encoded = azdias_cleaned_encoded.drop('PRAEGENDE_JUGENDJAHRE_MOVEMENT_', axis = 1)
print(azdias_cleaned_encoded.shape)
(798067, 71)
(798067, 70)
In [120]:
azdias_cleaned_encoded.head()
Out[120]:
ALTERSKATEGORIE_GROB ANREDE_KZ FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER GREEN_AVANTGARDE HEALTH_TYP LP_LEBENSPHASE_FEIN LP_LEBENSPHASE_GROB RETOURTYP_BK_S SEMIO_SOZ SEMIO_FAM SEMIO_REL SEMIO_MAT SEMIO_VERT SEMIO_LUST SEMIO_ERL SEMIO_KULT SEMIO_RAT SEMIO_KRIT SEMIO_DOM SEMIO_KAEM SEMIO_PFLICHT SEMIO_TRADV SOHO_KZ VERS_TYP ANZ_PERSONEN ANZ_TITEL HH_EINKOMMEN_SCORE W_KEIT_KIND_HH WOHNDAUER_2008 ANZ_HAUSHALTE_AKTIV ANZ_HH_TITEL KONSUMNAEHE MIN_GEBAEUDEJAHR WOHNLAGE CAMEO_INTL_2015 KBA05_ANTG1 KBA05_ANTG2 KBA05_ANTG3 KBA05_ANTG4 KBA05_GBZ BALLRAUM EWDICHTE INNENSTADT GEBAEUDETYP_RASTER KKK MOBI_REGIO ONLINE_AFFINITAET REGIOTYP KBA13_ANZAHL_PKW PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB NAN_count many_missing_values_in_row OST_WEST_KZ_O PRAEGENDE_JUGENDJAHRE_DECADE PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM
1 1.0 2.0 1.0 5.0 2.0 5.0 4.0 5.0 0.0 3.0 21.0 6.0 1.0 5.0 4.0 4.0 3.0 1.0 2.0 2.0 3.0 6.0 4.0 7.0 4.0 7.0 6.0 1.0 2.0 2.0 0.0 6.0 3.0 9.0 11.0 0.0 1.0 1992.0 4.0 51 0.0 0.0 0.0 2.0 1.0 6.0 3.0 8.0 3.0 2.0 1.0 3.0 3.0 963.0 2.0 3.0 2.0 1.0 1.0 5.0 4.0 3.0 5.0 4.0 0 0 0 90 0 1
2 3.0 2.0 1.0 4.0 1.0 2.0 3.0 5.0 1.0 3.0 3.0 1.0 3.0 4.0 1.0 3.0 3.0 4.0 4.0 6.0 3.0 4.0 7.0 7.0 7.0 3.0 3.0 0.0 1.0 1.0 0.0 4.0 3.0 9.0 10.0 0.0 5.0 1992.0 2.0 24 1.0 3.0 1.0 0.0 3.0 2.0 4.0 4.0 4.0 2.0 3.0 2.0 2.0 712.0 3.0 3.0 1.0 0.0 1.0 4.0 4.0 3.0 5.0 2.0 0 0 0 90 1 0
3 4.0 2.0 4.0 2.0 5.0 2.0 1.0 2.0 0.0 2.0 NaN NaN 2.0 5.0 1.0 2.0 1.0 4.0 4.0 7.0 4.0 3.0 4.0 4.0 5.0 4.0 4.0 0.0 1.0 0.0 0.0 1.0 NaN 9.0 1.0 0.0 4.0 1997.0 7.0 12 4.0 1.0 0.0 0.0 4.0 4.0 2.0 6.0 4.0 NaN 4.0 1.0 NaN 596.0 2.0 2.0 2.0 0.0 1.0 3.0 4.0 2.0 3.0 3.0 7 0 0 70 0 1
4 3.0 1.0 4.0 3.0 4.0 1.0 3.0 2.0 0.0 3.0 32.0 10.0 5.0 6.0 4.0 4.0 2.0 7.0 4.0 4.0 6.0 2.0 3.0 2.0 2.0 4.0 2.0 0.0 2.0 4.0 0.0 5.0 2.0 9.0 3.0 0.0 4.0 1992.0 3.0 43 1.0 4.0 1.0 0.0 3.0 2.0 5.0 1.0 5.0 3.0 3.0 5.0 5.0 435.0 2.0 4.0 2.0 1.0 2.0 3.0 3.0 4.0 6.0 5.0 0 0 0 70 0 1
5 1.0 2.0 3.0 1.0 5.0 2.0 2.0 5.0 0.0 3.0 8.0 2.0 3.0 2.0 4.0 7.0 4.0 2.0 2.0 2.0 5.0 7.0 4.0 4.0 4.0 7.0 6.0 0.0 2.0 1.0 0.0 5.0 6.0 9.0 5.0 0.0 5.0 1992.0 7.0 54 2.0 2.0 0.0 0.0 4.0 6.0 2.0 7.0 4.0 4.0 4.0 1.0 5.0 1300.0 2.0 3.0 1.0 1.0 1.0 5.0 5.0 2.0 3.0 3.0 0 0 0 50 0 1
Investigate "CAMEO_INTL_2015" and engineer two new variables.
In [121]:
azdias_cleaned_encoded["CAMEO_INTL_2015"].value_counts()
Out[121]:
51    133694
41     92336
24     91158
14     62884
43     56672
54     45391
25     39627
22     33154
23     26750
13     26335
45     26132
55     23955
52     20542
31     19024
34     18524
15     16974
44     14820
12     13249
35     10356
32     10354
33      9935
Name: CAMEO_INTL_2015, dtype: int64
In [122]:
azdias_cleaned_encoded.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 798067 entries, 1 to 891220
Data columns (total 70 columns):
ALTERSKATEGORIE_GROB                         795264 non-null float64
ANREDE_KZ                                    798067 non-null float64
FINANZ_MINIMALIST                            798067 non-null float64
FINANZ_SPARER                                798067 non-null float64
FINANZ_VORSORGER                             798067 non-null float64
FINANZ_ANLEGER                               798067 non-null float64
FINANZ_UNAUFFAELLIGER                        798067 non-null float64
FINANZ_HAUSBAUER                             798067 non-null float64
GREEN_AVANTGARDE                             798067 non-null float64
HEALTH_TYP                                   761341 non-null float64
LP_LEBENSPHASE_FEIN                          747828 non-null float64
LP_LEBENSPHASE_GROB                          750721 non-null float64
RETOURTYP_BK_S                               793318 non-null float64
SEMIO_SOZ                                    798067 non-null float64
SEMIO_FAM                                    798067 non-null float64
SEMIO_REL                                    798067 non-null float64
SEMIO_MAT                                    798067 non-null float64
SEMIO_VERT                                   798067 non-null float64
SEMIO_LUST                                   798067 non-null float64
SEMIO_ERL                                    798067 non-null float64
SEMIO_KULT                                   798067 non-null float64
SEMIO_RAT                                    798067 non-null float64
SEMIO_KRIT                                   798067 non-null float64
SEMIO_DOM                                    798067 non-null float64
SEMIO_KAEM                                   798067 non-null float64
SEMIO_PFLICHT                                798067 non-null float64
SEMIO_TRADV                                  798067 non-null float64
SOHO_KZ                                      798067 non-null float64
VERS_TYP                                     761341 non-null float64
ANZ_PERSONEN                                 798067 non-null float64
ANZ_TITEL                                    798067 non-null float64
HH_EINKOMMEN_SCORE                           798067 non-null float64
W_KEIT_KIND_HH                               738762 non-null float64
WOHNDAUER_2008                               798067 non-null float64
ANZ_HAUSHALTE_AKTIV                          791605 non-null float64
ANZ_HH_TITEL                                 794208 non-null float64
KONSUMNAEHE                                  797995 non-null float64
MIN_GEBAEUDEJAHR                             798067 non-null float64
WOHNLAGE                                     798067 non-null float64
CAMEO_INTL_2015                              791866 non-null object
KBA05_ANTG1                                  757897 non-null float64
KBA05_ANTG2                                  757897 non-null float64
KBA05_ANTG3                                  757897 non-null float64
KBA05_ANTG4                                  757897 non-null float64
KBA05_GBZ                                    757897 non-null float64
BALLRAUM                                     797475 non-null float64
EWDICHTE                                     797475 non-null float64
INNENSTADT                                   797475 non-null float64
GEBAEUDETYP_RASTER                           798060 non-null float64
KKK                                          733157 non-null float64
MOBI_REGIO                                   757897 non-null float64
ONLINE_AFFINITAET                            793318 non-null float64
REGIOTYP                                     733157 non-null float64
KBA13_ANZAHL_PKW                             785420 non-null float64
PLZ8_ANTG1                                   774706 non-null float64
PLZ8_ANTG2                                   774706 non-null float64
PLZ8_ANTG3                                   774706 non-null float64
PLZ8_ANTG4                                   774706 non-null float64
PLZ8_BAUMAX                                  774706 non-null float64
PLZ8_HHZ                                     774706 non-null float64
PLZ8_GBZ                                     774706 non-null float64
ARBEIT                                       793840 non-null float64
ORTSGR_KLS9                                  793941 non-null float64
RELAT_AB                                     793840 non-null float64
NAN_count                                    798067 non-null int64
many_missing_values_in_row                   798067 non-null int64
OST_WEST_KZ_O                                798067 non-null uint8
PRAEGENDE_JUGENDJAHRE_DECADE                 798067 non-null int64
PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE    798067 non-null uint8
PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM    798067 non-null uint8
dtypes: float64(63), int64(3), object(1), uint8(3)
memory usage: 416.3+ MB
In [123]:
"""
https://stackoverflow.com/questions/41271299/how-can-i-get-the-first-two-digits-of-a-number
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.slice.html
"""
azdias_cleaned_encoded["CAMEO_INTL_2015_WEALTH"] =azdias_cleaned_encoded["CAMEO_INTL_2015"].str.slice(0,1)

azdias_cleaned_encoded["CAMEO_INTL_2015_LIFE_STAGE_TYP"] =azdias_cleaned_encoded["CAMEO_INTL_2015"].str.slice(1,2)
In [124]:
azdias_cleaned_encoded[["CAMEO_INTL_2015", "CAMEO_INTL_2015_WEALTH","CAMEO_INTL_2015_LIFE_STAGE_TYP"]].head(10)
Out[124]:
CAMEO_INTL_2015 CAMEO_INTL_2015_WEALTH CAMEO_INTL_2015_LIFE_STAGE_TYP
1 51 5 1
2 24 2 4
3 12 1 2
4 43 4 3
5 54 5 4
6 22 2 2
7 14 1 4
8 13 1 3
9 15 1 5
10 51 5 1
In [125]:
pd.crosstab(azdias_cleaned_encoded["CAMEO_INTL_2015"], azdias_cleaned_encoded["CAMEO_INTL_2015_WEALTH"])
Out[125]:
CAMEO_INTL_2015_WEALTH 1 2 3 4 5
CAMEO_INTL_2015
12 13249 0 0 0 0
13 26335 0 0 0 0
14 62884 0 0 0 0
15 16974 0 0 0 0
22 0 33154 0 0 0
23 0 26750 0 0 0
24 0 91158 0 0 0
25 0 39627 0 0 0
31 0 0 19024 0 0
32 0 0 10354 0 0
33 0 0 9935 0 0
34 0 0 18524 0 0
35 0 0 10356 0 0
41 0 0 0 92336 0
43 0 0 0 56672 0
44 0 0 0 14820 0
45 0 0 0 26132 0
51 0 0 0 0 133694
52 0 0 0 0 20542
54 0 0 0 0 45391
55 0 0 0 0 23955
In [126]:
pd.crosstab(azdias_cleaned_encoded["CAMEO_INTL_2015"], azdias_cleaned_encoded["CAMEO_INTL_2015_LIFE_STAGE_TYP"])
Out[126]:
CAMEO_INTL_2015_LIFE_STAGE_TYP 1 2 3 4 5
CAMEO_INTL_2015
12 0 13249 0 0 0
13 0 0 26335 0 0
14 0 0 0 62884 0
15 0 0 0 0 16974
22 0 33154 0 0 0
23 0 0 26750 0 0
24 0 0 0 91158 0
25 0 0 0 0 39627
31 19024 0 0 0 0
32 0 10354 0 0 0
33 0 0 9935 0 0
34 0 0 0 18524 0
35 0 0 0 0 10356
41 92336 0 0 0 0
43 0 0 56672 0 0
44 0 0 0 14820 0
45 0 0 0 0 26132
51 133694 0 0 0 0
52 0 20542 0 0 0
54 0 0 0 45391 0
55 0 0 0 0 23955
In [127]:
print(azdias_cleaned_encoded.shape)
azdias_cleaned_encoded = azdias_cleaned_encoded.drop('CAMEO_INTL_2015', axis = 1)
print(azdias_cleaned_encoded.shape)
(798067, 72)
(798067, 71)
In [128]:
mixed_vars = feat_info[feat_info['type']=='mixed']
len(mixed_vars)
list(mixed_vars['attribute'])

# PRAEGENDE_JUGENDJAHRE and CAMEO_INTL_2015 has already been taken care of

# ['LP_LEBENSPHASE_FEIN',
#  'LP_LEBENSPHASE_GROB',
#  'PRAEGENDE_JUGENDJAHRE', Converted by PRAEGENDE_JUGENDJAHRE_DECADE and PRAEGENDE_JUGENDJAHRE_MOVEMENT
#  'WOHNLAGE',
#  'CAMEO_INTL_2015', Converted to CAMEO_INTL_2015_WEALTH and CAMEO_INTL_2015_LIFE_STAGE_TYP
#  'KBA05_BAUMAX',
#  'PLZ8_BAUMAX']
Out[128]:
['LP_LEBENSPHASE_FEIN',
 'LP_LEBENSPHASE_GROB',
 'PRAEGENDE_JUGENDJAHRE',
 'WOHNLAGE',
 'CAMEO_INTL_2015',
 'KBA05_BAUMAX',
 'PLZ8_BAUMAX']
In [129]:
# 8.6. PLZ8_BAUMAX
# Most common building type within the PLZ8 region
# 
# -1: unknown
# 0: unknown
# 1: mainly 1-2 family homes
# 2: mainly 3-5 family homes
# 3: mainly 6-10 family homes
# 4: mainly 10+ family homes
# 5: mainly business buildings


azdias_cleaned_encoded["PLZ8_BAUMAX"].value_counts()
Out[129]:
1.0    499550
5.0     97333
2.0     70407
4.0     56684
3.0     50732
Name: PLZ8_BAUMAX, dtype: int64
In [130]:
azdias_cleaned_encoded["PLZ8_BAUMAX"].isna().sum()
Out[130]:
23361
In [131]:
my_dict ={
"FAMILY": [1,2,3,4],
"BUSINESS": [5]
}

azdias_cleaned_encoded['PLZ8_BAUMAX_BLDNG_TYPE_TEMP'] = ""

recode_column_according_2_my_dict(azdias_cleaned_encoded, "PLZ8_BAUMAX", "PLZ8_BAUMAX_BLDNG_TYPE_TEMP", my_dict)
Out[131]:
PLZ8_BAUMAX_BLDNG_TYPE_TEMP BUSINESS FAMILY
PLZ8_BAUMAX
1.0 0 499550
2.0 0 70407
3.0 0 50732
4.0 0 56684
5.0 97333 0
In [132]:
azdias_cleaned_encoded.head()
Out[132]:
ALTERSKATEGORIE_GROB ANREDE_KZ FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER GREEN_AVANTGARDE HEALTH_TYP LP_LEBENSPHASE_FEIN LP_LEBENSPHASE_GROB RETOURTYP_BK_S SEMIO_SOZ SEMIO_FAM SEMIO_REL SEMIO_MAT SEMIO_VERT SEMIO_LUST SEMIO_ERL SEMIO_KULT SEMIO_RAT SEMIO_KRIT SEMIO_DOM SEMIO_KAEM SEMIO_PFLICHT SEMIO_TRADV SOHO_KZ VERS_TYP ANZ_PERSONEN ANZ_TITEL HH_EINKOMMEN_SCORE W_KEIT_KIND_HH WOHNDAUER_2008 ANZ_HAUSHALTE_AKTIV ANZ_HH_TITEL KONSUMNAEHE MIN_GEBAEUDEJAHR WOHNLAGE KBA05_ANTG1 KBA05_ANTG2 KBA05_ANTG3 KBA05_ANTG4 KBA05_GBZ BALLRAUM EWDICHTE INNENSTADT GEBAEUDETYP_RASTER KKK MOBI_REGIO ONLINE_AFFINITAET REGIOTYP KBA13_ANZAHL_PKW PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB NAN_count many_missing_values_in_row OST_WEST_KZ_O PRAEGENDE_JUGENDJAHRE_DECADE PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM CAMEO_INTL_2015_WEALTH CAMEO_INTL_2015_LIFE_STAGE_TYP PLZ8_BAUMAX_BLDNG_TYPE_TEMP
1 1.0 2.0 1.0 5.0 2.0 5.0 4.0 5.0 0.0 3.0 21.0 6.0 1.0 5.0 4.0 4.0 3.0 1.0 2.0 2.0 3.0 6.0 4.0 7.0 4.0 7.0 6.0 1.0 2.0 2.0 0.0 6.0 3.0 9.0 11.0 0.0 1.0 1992.0 4.0 0.0 0.0 0.0 2.0 1.0 6.0 3.0 8.0 3.0 2.0 1.0 3.0 3.0 963.0 2.0 3.0 2.0 1.0 1.0 5.0 4.0 3.0 5.0 4.0 0 0 0 90 0 1 5 1 FAMILY
2 3.0 2.0 1.0 4.0 1.0 2.0 3.0 5.0 1.0 3.0 3.0 1.0 3.0 4.0 1.0 3.0 3.0 4.0 4.0 6.0 3.0 4.0 7.0 7.0 7.0 3.0 3.0 0.0 1.0 1.0 0.0 4.0 3.0 9.0 10.0 0.0 5.0 1992.0 2.0 1.0 3.0 1.0 0.0 3.0 2.0 4.0 4.0 4.0 2.0 3.0 2.0 2.0 712.0 3.0 3.0 1.0 0.0 1.0 4.0 4.0 3.0 5.0 2.0 0 0 0 90 1 0 2 4 FAMILY
3 4.0 2.0 4.0 2.0 5.0 2.0 1.0 2.0 0.0 2.0 NaN NaN 2.0 5.0 1.0 2.0 1.0 4.0 4.0 7.0 4.0 3.0 4.0 4.0 5.0 4.0 4.0 0.0 1.0 0.0 0.0 1.0 NaN 9.0 1.0 0.0 4.0 1997.0 7.0 4.0 1.0 0.0 0.0 4.0 4.0 2.0 6.0 4.0 NaN 4.0 1.0 NaN 596.0 2.0 2.0 2.0 0.0 1.0 3.0 4.0 2.0 3.0 3.0 7 0 0 70 0 1 1 2 FAMILY
4 3.0 1.0 4.0 3.0 4.0 1.0 3.0 2.0 0.0 3.0 32.0 10.0 5.0 6.0 4.0 4.0 2.0 7.0 4.0 4.0 6.0 2.0 3.0 2.0 2.0 4.0 2.0 0.0 2.0 4.0 0.0 5.0 2.0 9.0 3.0 0.0 4.0 1992.0 3.0 1.0 4.0 1.0 0.0 3.0 2.0 5.0 1.0 5.0 3.0 3.0 5.0 5.0 435.0 2.0 4.0 2.0 1.0 2.0 3.0 3.0 4.0 6.0 5.0 0 0 0 70 0 1 4 3 FAMILY
5 1.0 2.0 3.0 1.0 5.0 2.0 2.0 5.0 0.0 3.0 8.0 2.0 3.0 2.0 4.0 7.0 4.0 2.0 2.0 2.0 5.0 7.0 4.0 4.0 4.0 7.0 6.0 0.0 2.0 1.0 0.0 5.0 6.0 9.0 5.0 0.0 5.0 1992.0 7.0 2.0 2.0 0.0 0.0 4.0 6.0 2.0 7.0 4.0 4.0 4.0 1.0 5.0 1300.0 2.0 3.0 1.0 1.0 1.0 5.0 5.0 2.0 3.0 3.0 0 0 0 50 0 1 5 4 FAMILY
In [133]:
azdias_cleaned_encoded = pd.concat([azdias_cleaned_encoded, pd.get_dummies(azdias_cleaned_encoded['PLZ8_BAUMAX_BLDNG_TYPE_TEMP'], prefix='PLZ8_BAUMAX_BLDNG_TYPE')], axis=1)
In [134]:
pd.crosstab(azdias_cleaned_encoded['PLZ8_BAUMAX_BLDNG_TYPE_TEMP'], azdias_cleaned_encoded['PLZ8_BAUMAX_BLDNG_TYPE_BUSINESS'])
Out[134]:
PLZ8_BAUMAX_BLDNG_TYPE_BUSINESS 0 1
PLZ8_BAUMAX_BLDNG_TYPE_TEMP
23361 0
BUSINESS 0 97333
FAMILY 677373 0
In [135]:
pd.crosstab(azdias_cleaned_encoded['PLZ8_BAUMAX_BLDNG_TYPE_TEMP'], azdias_cleaned_encoded['PLZ8_BAUMAX_BLDNG_TYPE_FAMILY'])
Out[135]:
PLZ8_BAUMAX_BLDNG_TYPE_FAMILY 0 1
PLZ8_BAUMAX_BLDNG_TYPE_TEMP
23361 0
BUSINESS 97333 0
FAMILY 0 677373
In [136]:
pd.crosstab(azdias_cleaned_encoded['PLZ8_BAUMAX_BLDNG_TYPE_BUSINESS'], azdias_cleaned_encoded['PLZ8_BAUMAX_BLDNG_TYPE_FAMILY'])
Out[136]:
PLZ8_BAUMAX_BLDNG_TYPE_FAMILY 0 1
PLZ8_BAUMAX_BLDNG_TYPE_BUSINESS
0 23361 677373
1 97333 0
In [137]:
azdias_cleaned_encoded.head()
Out[137]:
ALTERSKATEGORIE_GROB ANREDE_KZ FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER GREEN_AVANTGARDE HEALTH_TYP LP_LEBENSPHASE_FEIN LP_LEBENSPHASE_GROB RETOURTYP_BK_S SEMIO_SOZ SEMIO_FAM SEMIO_REL SEMIO_MAT SEMIO_VERT SEMIO_LUST SEMIO_ERL SEMIO_KULT SEMIO_RAT SEMIO_KRIT SEMIO_DOM SEMIO_KAEM SEMIO_PFLICHT SEMIO_TRADV SOHO_KZ VERS_TYP ANZ_PERSONEN ANZ_TITEL HH_EINKOMMEN_SCORE W_KEIT_KIND_HH WOHNDAUER_2008 ANZ_HAUSHALTE_AKTIV ANZ_HH_TITEL KONSUMNAEHE MIN_GEBAEUDEJAHR WOHNLAGE KBA05_ANTG1 KBA05_ANTG2 KBA05_ANTG3 KBA05_ANTG4 KBA05_GBZ BALLRAUM EWDICHTE INNENSTADT GEBAEUDETYP_RASTER KKK MOBI_REGIO ONLINE_AFFINITAET REGIOTYP KBA13_ANZAHL_PKW PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB NAN_count many_missing_values_in_row OST_WEST_KZ_O PRAEGENDE_JUGENDJAHRE_DECADE PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM CAMEO_INTL_2015_WEALTH CAMEO_INTL_2015_LIFE_STAGE_TYP PLZ8_BAUMAX_BLDNG_TYPE_TEMP PLZ8_BAUMAX_BLDNG_TYPE_ PLZ8_BAUMAX_BLDNG_TYPE_BUSINESS PLZ8_BAUMAX_BLDNG_TYPE_FAMILY
1 1.0 2.0 1.0 5.0 2.0 5.0 4.0 5.0 0.0 3.0 21.0 6.0 1.0 5.0 4.0 4.0 3.0 1.0 2.0 2.0 3.0 6.0 4.0 7.0 4.0 7.0 6.0 1.0 2.0 2.0 0.0 6.0 3.0 9.0 11.0 0.0 1.0 1992.0 4.0 0.0 0.0 0.0 2.0 1.0 6.0 3.0 8.0 3.0 2.0 1.0 3.0 3.0 963.0 2.0 3.0 2.0 1.0 1.0 5.0 4.0 3.0 5.0 4.0 0 0 0 90 0 1 5 1 FAMILY 0 0 1
2 3.0 2.0 1.0 4.0 1.0 2.0 3.0 5.0 1.0 3.0 3.0 1.0 3.0 4.0 1.0 3.0 3.0 4.0 4.0 6.0 3.0 4.0 7.0 7.0 7.0 3.0 3.0 0.0 1.0 1.0 0.0 4.0 3.0 9.0 10.0 0.0 5.0 1992.0 2.0 1.0 3.0 1.0 0.0 3.0 2.0 4.0 4.0 4.0 2.0 3.0 2.0 2.0 712.0 3.0 3.0 1.0 0.0 1.0 4.0 4.0 3.0 5.0 2.0 0 0 0 90 1 0 2 4 FAMILY 0 0 1
3 4.0 2.0 4.0 2.0 5.0 2.0 1.0 2.0 0.0 2.0 NaN NaN 2.0 5.0 1.0 2.0 1.0 4.0 4.0 7.0 4.0 3.0 4.0 4.0 5.0 4.0 4.0 0.0 1.0 0.0 0.0 1.0 NaN 9.0 1.0 0.0 4.0 1997.0 7.0 4.0 1.0 0.0 0.0 4.0 4.0 2.0 6.0 4.0 NaN 4.0 1.0 NaN 596.0 2.0 2.0 2.0 0.0 1.0 3.0 4.0 2.0 3.0 3.0 7 0 0 70 0 1 1 2 FAMILY 0 0 1
4 3.0 1.0 4.0 3.0 4.0 1.0 3.0 2.0 0.0 3.0 32.0 10.0 5.0 6.0 4.0 4.0 2.0 7.0 4.0 4.0 6.0 2.0 3.0 2.0 2.0 4.0 2.0 0.0 2.0 4.0 0.0 5.0 2.0 9.0 3.0 0.0 4.0 1992.0 3.0 1.0 4.0 1.0 0.0 3.0 2.0 5.0 1.0 5.0 3.0 3.0 5.0 5.0 435.0 2.0 4.0 2.0 1.0 2.0 3.0 3.0 4.0 6.0 5.0 0 0 0 70 0 1 4 3 FAMILY 0 0 1
5 1.0 2.0 3.0 1.0 5.0 2.0 2.0 5.0 0.0 3.0 8.0 2.0 3.0 2.0 4.0 7.0 4.0 2.0 2.0 2.0 5.0 7.0 4.0 4.0 4.0 7.0 6.0 0.0 2.0 1.0 0.0 5.0 6.0 9.0 5.0 0.0 5.0 1992.0 7.0 2.0 2.0 0.0 0.0 4.0 6.0 2.0 7.0 4.0 4.0 4.0 1.0 5.0 1300.0 2.0 3.0 1.0 1.0 1.0 5.0 5.0 2.0 3.0 3.0 0 0 0 50 0 1 5 4 FAMILY 0 0 1
In [138]:
print(azdias_cleaned_encoded.shape)
azdias_cleaned_encoded = azdias_cleaned_encoded.drop(['PLZ8_BAUMAX_BLDNG_TYPE_TEMP','PLZ8_BAUMAX_BLDNG_TYPE_'], axis = 1)
print(azdias_cleaned_encoded.shape)
(798067, 75)
(798067, 73)
In [139]:
azdias_cleaned_encoded.head()
Out[139]:
ALTERSKATEGORIE_GROB ANREDE_KZ FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER GREEN_AVANTGARDE HEALTH_TYP LP_LEBENSPHASE_FEIN LP_LEBENSPHASE_GROB RETOURTYP_BK_S SEMIO_SOZ SEMIO_FAM SEMIO_REL SEMIO_MAT SEMIO_VERT SEMIO_LUST SEMIO_ERL SEMIO_KULT SEMIO_RAT SEMIO_KRIT SEMIO_DOM SEMIO_KAEM SEMIO_PFLICHT SEMIO_TRADV SOHO_KZ VERS_TYP ANZ_PERSONEN ANZ_TITEL HH_EINKOMMEN_SCORE W_KEIT_KIND_HH WOHNDAUER_2008 ANZ_HAUSHALTE_AKTIV ANZ_HH_TITEL KONSUMNAEHE MIN_GEBAEUDEJAHR WOHNLAGE KBA05_ANTG1 KBA05_ANTG2 KBA05_ANTG3 KBA05_ANTG4 KBA05_GBZ BALLRAUM EWDICHTE INNENSTADT GEBAEUDETYP_RASTER KKK MOBI_REGIO ONLINE_AFFINITAET REGIOTYP KBA13_ANZAHL_PKW PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB NAN_count many_missing_values_in_row OST_WEST_KZ_O PRAEGENDE_JUGENDJAHRE_DECADE PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM CAMEO_INTL_2015_WEALTH CAMEO_INTL_2015_LIFE_STAGE_TYP PLZ8_BAUMAX_BLDNG_TYPE_BUSINESS PLZ8_BAUMAX_BLDNG_TYPE_FAMILY
1 1.0 2.0 1.0 5.0 2.0 5.0 4.0 5.0 0.0 3.0 21.0 6.0 1.0 5.0 4.0 4.0 3.0 1.0 2.0 2.0 3.0 6.0 4.0 7.0 4.0 7.0 6.0 1.0 2.0 2.0 0.0 6.0 3.0 9.0 11.0 0.0 1.0 1992.0 4.0 0.0 0.0 0.0 2.0 1.0 6.0 3.0 8.0 3.0 2.0 1.0 3.0 3.0 963.0 2.0 3.0 2.0 1.0 1.0 5.0 4.0 3.0 5.0 4.0 0 0 0 90 0 1 5 1 0 1
2 3.0 2.0 1.0 4.0 1.0 2.0 3.0 5.0 1.0 3.0 3.0 1.0 3.0 4.0 1.0 3.0 3.0 4.0 4.0 6.0 3.0 4.0 7.0 7.0 7.0 3.0 3.0 0.0 1.0 1.0 0.0 4.0 3.0 9.0 10.0 0.0 5.0 1992.0 2.0 1.0 3.0 1.0 0.0 3.0 2.0 4.0 4.0 4.0 2.0 3.0 2.0 2.0 712.0 3.0 3.0 1.0 0.0 1.0 4.0 4.0 3.0 5.0 2.0 0 0 0 90 1 0 2 4 0 1
3 4.0 2.0 4.0 2.0 5.0 2.0 1.0 2.0 0.0 2.0 NaN NaN 2.0 5.0 1.0 2.0 1.0 4.0 4.0 7.0 4.0 3.0 4.0 4.0 5.0 4.0 4.0 0.0 1.0 0.0 0.0 1.0 NaN 9.0 1.0 0.0 4.0 1997.0 7.0 4.0 1.0 0.0 0.0 4.0 4.0 2.0 6.0 4.0 NaN 4.0 1.0 NaN 596.0 2.0 2.0 2.0 0.0 1.0 3.0 4.0 2.0 3.0 3.0 7 0 0 70 0 1 1 2 0 1
4 3.0 1.0 4.0 3.0 4.0 1.0 3.0 2.0 0.0 3.0 32.0 10.0 5.0 6.0 4.0 4.0 2.0 7.0 4.0 4.0 6.0 2.0 3.0 2.0 2.0 4.0 2.0 0.0 2.0 4.0 0.0 5.0 2.0 9.0 3.0 0.0 4.0 1992.0 3.0 1.0 4.0 1.0 0.0 3.0 2.0 5.0 1.0 5.0 3.0 3.0 5.0 5.0 435.0 2.0 4.0 2.0 1.0 2.0 3.0 3.0 4.0 6.0 5.0 0 0 0 70 0 1 4 3 0 1
5 1.0 2.0 3.0 1.0 5.0 2.0 2.0 5.0 0.0 3.0 8.0 2.0 3.0 2.0 4.0 7.0 4.0 2.0 2.0 2.0 5.0 7.0 4.0 4.0 4.0 7.0 6.0 0.0 2.0 1.0 0.0 5.0 6.0 9.0 5.0 0.0 5.0 1992.0 7.0 2.0 2.0 0.0 0.0 4.0 6.0 2.0 7.0 4.0 4.0 4.0 1.0 5.0 1300.0 2.0 3.0 1.0 1.0 1.0 5.0 5.0 2.0 3.0 3.0 0 0 0 50 0 1 5 4 0 1
In [140]:
pd.crosstab(azdias_cleaned_encoded.PLZ8_BAUMAX, azdias_cleaned_encoded.PLZ8_BAUMAX_BLDNG_TYPE_BUSINESS)
Out[140]:
PLZ8_BAUMAX_BLDNG_TYPE_BUSINESS 0 1
PLZ8_BAUMAX
1.0 499550 0
2.0 70407 0
3.0 50732 0
4.0 56684 0
5.0 0 97333
In [141]:
pd.crosstab(azdias_cleaned_encoded.PLZ8_BAUMAX, azdias_cleaned_encoded.PLZ8_BAUMAX_BLDNG_TYPE_FAMILY)
Out[141]:
PLZ8_BAUMAX_BLDNG_TYPE_FAMILY 0 1
PLZ8_BAUMAX
1.0 0 499550
2.0 0 70407
3.0 0 50732
4.0 0 56684
5.0 97333 0
In [142]:
# how many family homes are there?
condition = azdias_cleaned_encoded['PLZ8_BAUMAX'] == 5
 
azdias_cleaned_encoded['PLZ8_BAUMAX_FAMILY_HOMES'] = azdias_cleaned_encoded['PLZ8_BAUMAX']
 
azdias_cleaned_encoded['PLZ8_BAUMAX_FAMILY_HOMES']  = np.where(condition==True, 0, azdias_cleaned_encoded['PLZ8_BAUMAX'])

pd.crosstab(azdias_cleaned_encoded['PLZ8_BAUMAX'], azdias_cleaned_encoded['PLZ8_BAUMAX_FAMILY_HOMES'] )
Out[142]:
PLZ8_BAUMAX_FAMILY_HOMES 0.0 1.0 2.0 3.0 4.0
PLZ8_BAUMAX
1.0 0 499550 0 0 0
2.0 0 0 70407 0 0
3.0 0 0 0 50732 0
4.0 0 0 0 0 56684
5.0 97333 0 0 0 0
In [143]:
azdias_cleaned_encoded['PLZ8_BAUMAX'].value_counts()
Out[143]:
1.0    499550
5.0     97333
2.0     70407
4.0     56684
3.0     50732
Name: PLZ8_BAUMAX, dtype: int64
In [144]:
azdias_cleaned_encoded['PLZ8_BAUMAX_FAMILY_HOMES'].value_counts()
Out[144]:
1.0    499550
0.0     97333
2.0     70407
4.0     56684
3.0     50732
Name: PLZ8_BAUMAX_FAMILY_HOMES, dtype: int64
In [145]:
print(azdias_cleaned_encoded.shape)
azdias_cleaned_encoded = azdias_cleaned_encoded.drop('PLZ8_BAUMAX', axis = 1)
print(azdias_cleaned_encoded.shape)
(798067, 74)
(798067, 73)
In [146]:
azdias_cleaned_encoded.head()
Out[146]:
ALTERSKATEGORIE_GROB ANREDE_KZ FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER GREEN_AVANTGARDE HEALTH_TYP LP_LEBENSPHASE_FEIN LP_LEBENSPHASE_GROB RETOURTYP_BK_S SEMIO_SOZ SEMIO_FAM SEMIO_REL SEMIO_MAT SEMIO_VERT SEMIO_LUST SEMIO_ERL SEMIO_KULT SEMIO_RAT SEMIO_KRIT SEMIO_DOM SEMIO_KAEM SEMIO_PFLICHT SEMIO_TRADV SOHO_KZ VERS_TYP ANZ_PERSONEN ANZ_TITEL HH_EINKOMMEN_SCORE W_KEIT_KIND_HH WOHNDAUER_2008 ANZ_HAUSHALTE_AKTIV ANZ_HH_TITEL KONSUMNAEHE MIN_GEBAEUDEJAHR WOHNLAGE KBA05_ANTG1 KBA05_ANTG2 KBA05_ANTG3 KBA05_ANTG4 KBA05_GBZ BALLRAUM EWDICHTE INNENSTADT GEBAEUDETYP_RASTER KKK MOBI_REGIO ONLINE_AFFINITAET REGIOTYP KBA13_ANZAHL_PKW PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB NAN_count many_missing_values_in_row OST_WEST_KZ_O PRAEGENDE_JUGENDJAHRE_DECADE PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM CAMEO_INTL_2015_WEALTH CAMEO_INTL_2015_LIFE_STAGE_TYP PLZ8_BAUMAX_BLDNG_TYPE_BUSINESS PLZ8_BAUMAX_BLDNG_TYPE_FAMILY PLZ8_BAUMAX_FAMILY_HOMES
1 1.0 2.0 1.0 5.0 2.0 5.0 4.0 5.0 0.0 3.0 21.0 6.0 1.0 5.0 4.0 4.0 3.0 1.0 2.0 2.0 3.0 6.0 4.0 7.0 4.0 7.0 6.0 1.0 2.0 2.0 0.0 6.0 3.0 9.0 11.0 0.0 1.0 1992.0 4.0 0.0 0.0 0.0 2.0 1.0 6.0 3.0 8.0 3.0 2.0 1.0 3.0 3.0 963.0 2.0 3.0 2.0 1.0 5.0 4.0 3.0 5.0 4.0 0 0 0 90 0 1 5 1 0 1 1.0
2 3.0 2.0 1.0 4.0 1.0 2.0 3.0 5.0 1.0 3.0 3.0 1.0 3.0 4.0 1.0 3.0 3.0 4.0 4.0 6.0 3.0 4.0 7.0 7.0 7.0 3.0 3.0 0.0 1.0 1.0 0.0 4.0 3.0 9.0 10.0 0.0 5.0 1992.0 2.0 1.0 3.0 1.0 0.0 3.0 2.0 4.0 4.0 4.0 2.0 3.0 2.0 2.0 712.0 3.0 3.0 1.0 0.0 4.0 4.0 3.0 5.0 2.0 0 0 0 90 1 0 2 4 0 1 1.0
3 4.0 2.0 4.0 2.0 5.0 2.0 1.0 2.0 0.0 2.0 NaN NaN 2.0 5.0 1.0 2.0 1.0 4.0 4.0 7.0 4.0 3.0 4.0 4.0 5.0 4.0 4.0 0.0 1.0 0.0 0.0 1.0 NaN 9.0 1.0 0.0 4.0 1997.0 7.0 4.0 1.0 0.0 0.0 4.0 4.0 2.0 6.0 4.0 NaN 4.0 1.0 NaN 596.0 2.0 2.0 2.0 0.0 3.0 4.0 2.0 3.0 3.0 7 0 0 70 0 1 1 2 0 1 1.0
4 3.0 1.0 4.0 3.0 4.0 1.0 3.0 2.0 0.0 3.0 32.0 10.0 5.0 6.0 4.0 4.0 2.0 7.0 4.0 4.0 6.0 2.0 3.0 2.0 2.0 4.0 2.0 0.0 2.0 4.0 0.0 5.0 2.0 9.0 3.0 0.0 4.0 1992.0 3.0 1.0 4.0 1.0 0.0 3.0 2.0 5.0 1.0 5.0 3.0 3.0 5.0 5.0 435.0 2.0 4.0 2.0 1.0 3.0 3.0 4.0 6.0 5.0 0 0 0 70 0 1 4 3 0 1 2.0
5 1.0 2.0 3.0 1.0 5.0 2.0 2.0 5.0 0.0 3.0 8.0 2.0 3.0 2.0 4.0 7.0 4.0 2.0 2.0 2.0 5.0 7.0 4.0 4.0 4.0 7.0 6.0 0.0 2.0 1.0 0.0 5.0 6.0 9.0 5.0 0.0 5.0 1992.0 7.0 2.0 2.0 0.0 0.0 4.0 6.0 2.0 7.0 4.0 4.0 4.0 1.0 5.0 1300.0 2.0 3.0 1.0 1.0 5.0 5.0 2.0 3.0 3.0 0 0 0 50 0 1 5 4 0 1 1.0
In [147]:
pd.crosstab(azdias_cleaned_encoded.PLZ8_BAUMAX_BLDNG_TYPE_FAMILY, azdias_cleaned_encoded.PLZ8_BAUMAX_FAMILY_HOMES)
Out[147]:
PLZ8_BAUMAX_FAMILY_HOMES 0.0 1.0 2.0 3.0 4.0
PLZ8_BAUMAX_BLDNG_TYPE_FAMILY
0 97333 0 0 0 0
1 0 499550 70407 50732 56684
In [148]:
# 3.7. WOHNLAGE
# Neighborhood quality (or rural flag)
# 
# -1: unknown
# 0: no score calculated
# 1: very good neighborhood
# 2: good neighborhood
# 3: average neighborhood
# 4: poor neighborhood
# 5: very poor neighborhood
# 7: rural neighborhood
# 8: new building in rural neighborhood


# note: 0 means no score calculated. In feature_info file, this should have been included into a code for missing/uknown(?)
In [149]:
#explore
print(azdias_cleaned_encoded["WOHNLAGE"].value_counts())
print()
3.0    249719
7.0    169317
4.0    135973
2.0    100376
5.0     74346
1.0     43917
8.0     17472
0.0      6947
Name: WOHNLAGE, dtype: int64

In [150]:
#explore nan
print(azdias_cleaned_encoded["WOHNLAGE"].isna().sum())
0
In [151]:
condition = azdias_cleaned_encoded["WOHNLAGE"].isin([7,8])
 
azdias_cleaned_encoded["WOHNLAGE_RURAL_FLAG"] = np.where(condition==True, 1,0)

pd.crosstab(azdias_cleaned_encoded["WOHNLAGE"], azdias_cleaned_encoded["WOHNLAGE_RURAL_FLAG"] )
Out[151]:
WOHNLAGE_RURAL_FLAG 0 1
WOHNLAGE
0.0 6947 0
1.0 43917 0
2.0 100376 0
3.0 249719 0
4.0 135973 0
5.0 74346 0
7.0 0 169317
8.0 0 17472
In [152]:
condition = azdias_cleaned_encoded["WOHNLAGE"].isin([7,8])
 
azdias_cleaned_encoded["WOHNLAGE_CITY_NEIGHBOURHOOD"] = np.where(condition==True, 0,azdias_cleaned_encoded["WOHNLAGE"])

pd.crosstab(azdias_cleaned_encoded["WOHNLAGE"], azdias_cleaned_encoded["WOHNLAGE_CITY_NEIGHBOURHOOD"])
Out[152]:
WOHNLAGE_CITY_NEIGHBOURHOOD 0.0 1.0 2.0 3.0 4.0 5.0
WOHNLAGE
0.0 6947 0 0 0 0 0
1.0 0 43917 0 0 0 0
2.0 0 0 100376 0 0 0
3.0 0 0 0 249719 0 0
4.0 0 0 0 0 135973 0
5.0 0 0 0 0 0 74346
7.0 169317 0 0 0 0 0
8.0 17472 0 0 0 0 0
In [153]:
azdias_cleaned_encoded["WOHNLAGE"].value_counts()
Out[153]:
3.0    249719
7.0    169317
4.0    135973
2.0    100376
5.0     74346
1.0     43917
8.0     17472
0.0      6947
Name: WOHNLAGE, dtype: int64
In [154]:
azdias_cleaned_encoded["WOHNLAGE_CITY_NEIGHBOURHOOD"].value_counts()
Out[154]:
3.0    249719
0.0    193736
4.0    135973
2.0    100376
5.0     74346
1.0     43917
Name: WOHNLAGE_CITY_NEIGHBOURHOOD, dtype: int64
In [155]:
# drop the original column
print(azdias_cleaned_encoded.shape)
azdias_cleaned_encoded = azdias_cleaned_encoded.drop(['WOHNLAGE'], axis = 1)
print(azdias_cleaned_encoded.shape)
(798067, 75)
(798067, 74)
In [156]:
azdias_cleaned_encoded.head()
Out[156]:
ALTERSKATEGORIE_GROB ANREDE_KZ FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER GREEN_AVANTGARDE HEALTH_TYP LP_LEBENSPHASE_FEIN LP_LEBENSPHASE_GROB RETOURTYP_BK_S SEMIO_SOZ SEMIO_FAM SEMIO_REL SEMIO_MAT SEMIO_VERT SEMIO_LUST SEMIO_ERL SEMIO_KULT SEMIO_RAT SEMIO_KRIT SEMIO_DOM SEMIO_KAEM SEMIO_PFLICHT SEMIO_TRADV SOHO_KZ VERS_TYP ANZ_PERSONEN ANZ_TITEL HH_EINKOMMEN_SCORE W_KEIT_KIND_HH WOHNDAUER_2008 ANZ_HAUSHALTE_AKTIV ANZ_HH_TITEL KONSUMNAEHE MIN_GEBAEUDEJAHR KBA05_ANTG1 KBA05_ANTG2 KBA05_ANTG3 KBA05_ANTG4 KBA05_GBZ BALLRAUM EWDICHTE INNENSTADT GEBAEUDETYP_RASTER KKK MOBI_REGIO ONLINE_AFFINITAET REGIOTYP KBA13_ANZAHL_PKW PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB NAN_count many_missing_values_in_row OST_WEST_KZ_O PRAEGENDE_JUGENDJAHRE_DECADE PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM CAMEO_INTL_2015_WEALTH CAMEO_INTL_2015_LIFE_STAGE_TYP PLZ8_BAUMAX_BLDNG_TYPE_BUSINESS PLZ8_BAUMAX_BLDNG_TYPE_FAMILY PLZ8_BAUMAX_FAMILY_HOMES WOHNLAGE_RURAL_FLAG WOHNLAGE_CITY_NEIGHBOURHOOD
1 1.0 2.0 1.0 5.0 2.0 5.0 4.0 5.0 0.0 3.0 21.0 6.0 1.0 5.0 4.0 4.0 3.0 1.0 2.0 2.0 3.0 6.0 4.0 7.0 4.0 7.0 6.0 1.0 2.0 2.0 0.0 6.0 3.0 9.0 11.0 0.0 1.0 1992.0 0.0 0.0 0.0 2.0 1.0 6.0 3.0 8.0 3.0 2.0 1.0 3.0 3.0 963.0 2.0 3.0 2.0 1.0 5.0 4.0 3.0 5.0 4.0 0 0 0 90 0 1 5 1 0 1 1.0 0 4.0
2 3.0 2.0 1.0 4.0 1.0 2.0 3.0 5.0 1.0 3.0 3.0 1.0 3.0 4.0 1.0 3.0 3.0 4.0 4.0 6.0 3.0 4.0 7.0 7.0 7.0 3.0 3.0 0.0 1.0 1.0 0.0 4.0 3.0 9.0 10.0 0.0 5.0 1992.0 1.0 3.0 1.0 0.0 3.0 2.0 4.0 4.0 4.0 2.0 3.0 2.0 2.0 712.0 3.0 3.0 1.0 0.0 4.0 4.0 3.0 5.0 2.0 0 0 0 90 1 0 2 4 0 1 1.0 0 2.0
3 4.0 2.0 4.0 2.0 5.0 2.0 1.0 2.0 0.0 2.0 NaN NaN 2.0 5.0 1.0 2.0 1.0 4.0 4.0 7.0 4.0 3.0 4.0 4.0 5.0 4.0 4.0 0.0 1.0 0.0 0.0 1.0 NaN 9.0 1.0 0.0 4.0 1997.0 4.0 1.0 0.0 0.0 4.0 4.0 2.0 6.0 4.0 NaN 4.0 1.0 NaN 596.0 2.0 2.0 2.0 0.0 3.0 4.0 2.0 3.0 3.0 7 0 0 70 0 1 1 2 0 1 1.0 1 0.0
4 3.0 1.0 4.0 3.0 4.0 1.0 3.0 2.0 0.0 3.0 32.0 10.0 5.0 6.0 4.0 4.0 2.0 7.0 4.0 4.0 6.0 2.0 3.0 2.0 2.0 4.0 2.0 0.0 2.0 4.0 0.0 5.0 2.0 9.0 3.0 0.0 4.0 1992.0 1.0 4.0 1.0 0.0 3.0 2.0 5.0 1.0 5.0 3.0 3.0 5.0 5.0 435.0 2.0 4.0 2.0 1.0 3.0 3.0 4.0 6.0 5.0 0 0 0 70 0 1 4 3 0 1 2.0 0 3.0
5 1.0 2.0 3.0 1.0 5.0 2.0 2.0 5.0 0.0 3.0 8.0 2.0 3.0 2.0 4.0 7.0 4.0 2.0 2.0 2.0 5.0 7.0 4.0 4.0 4.0 7.0 6.0 0.0 2.0 1.0 0.0 5.0 6.0 9.0 5.0 0.0 5.0 1992.0 2.0 2.0 0.0 0.0 4.0 6.0 2.0 7.0 4.0 4.0 4.0 1.0 5.0 1300.0 2.0 3.0 1.0 1.0 5.0 5.0 2.0 3.0 3.0 0 0 0 50 0 1 5 4 0 1 1.0 1 0.0
In [157]:
print(azdias_cleaned_encoded.shape)
azdias_cleaned_encoded = azdias_cleaned_encoded.drop(['LP_LEBENSPHASE_FEIN','LP_LEBENSPHASE_GROB'], axis = 1)
print(azdias_cleaned_encoded.shape)
(798067, 74)
(798067, 72)
In [158]:
azdias_cleaned_encoded.head()
Out[158]:
ALTERSKATEGORIE_GROB ANREDE_KZ FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER GREEN_AVANTGARDE HEALTH_TYP RETOURTYP_BK_S SEMIO_SOZ SEMIO_FAM SEMIO_REL SEMIO_MAT SEMIO_VERT SEMIO_LUST SEMIO_ERL SEMIO_KULT SEMIO_RAT SEMIO_KRIT SEMIO_DOM SEMIO_KAEM SEMIO_PFLICHT SEMIO_TRADV SOHO_KZ VERS_TYP ANZ_PERSONEN ANZ_TITEL HH_EINKOMMEN_SCORE W_KEIT_KIND_HH WOHNDAUER_2008 ANZ_HAUSHALTE_AKTIV ANZ_HH_TITEL KONSUMNAEHE MIN_GEBAEUDEJAHR KBA05_ANTG1 KBA05_ANTG2 KBA05_ANTG3 KBA05_ANTG4 KBA05_GBZ BALLRAUM EWDICHTE INNENSTADT GEBAEUDETYP_RASTER KKK MOBI_REGIO ONLINE_AFFINITAET REGIOTYP KBA13_ANZAHL_PKW PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB NAN_count many_missing_values_in_row OST_WEST_KZ_O PRAEGENDE_JUGENDJAHRE_DECADE PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM CAMEO_INTL_2015_WEALTH CAMEO_INTL_2015_LIFE_STAGE_TYP PLZ8_BAUMAX_BLDNG_TYPE_BUSINESS PLZ8_BAUMAX_BLDNG_TYPE_FAMILY PLZ8_BAUMAX_FAMILY_HOMES WOHNLAGE_RURAL_FLAG WOHNLAGE_CITY_NEIGHBOURHOOD
1 1.0 2.0 1.0 5.0 2.0 5.0 4.0 5.0 0.0 3.0 1.0 5.0 4.0 4.0 3.0 1.0 2.0 2.0 3.0 6.0 4.0 7.0 4.0 7.0 6.0 1.0 2.0 2.0 0.0 6.0 3.0 9.0 11.0 0.0 1.0 1992.0 0.0 0.0 0.0 2.0 1.0 6.0 3.0 8.0 3.0 2.0 1.0 3.0 3.0 963.0 2.0 3.0 2.0 1.0 5.0 4.0 3.0 5.0 4.0 0 0 0 90 0 1 5 1 0 1 1.0 0 4.0
2 3.0 2.0 1.0 4.0 1.0 2.0 3.0 5.0 1.0 3.0 3.0 4.0 1.0 3.0 3.0 4.0 4.0 6.0 3.0 4.0 7.0 7.0 7.0 3.0 3.0 0.0 1.0 1.0 0.0 4.0 3.0 9.0 10.0 0.0 5.0 1992.0 1.0 3.0 1.0 0.0 3.0 2.0 4.0 4.0 4.0 2.0 3.0 2.0 2.0 712.0 3.0 3.0 1.0 0.0 4.0 4.0 3.0 5.0 2.0 0 0 0 90 1 0 2 4 0 1 1.0 0 2.0
3 4.0 2.0 4.0 2.0 5.0 2.0 1.0 2.0 0.0 2.0 2.0 5.0 1.0 2.0 1.0 4.0 4.0 7.0 4.0 3.0 4.0 4.0 5.0 4.0 4.0 0.0 1.0 0.0 0.0 1.0 NaN 9.0 1.0 0.0 4.0 1997.0 4.0 1.0 0.0 0.0 4.0 4.0 2.0 6.0 4.0 NaN 4.0 1.0 NaN 596.0 2.0 2.0 2.0 0.0 3.0 4.0 2.0 3.0 3.0 7 0 0 70 0 1 1 2 0 1 1.0 1 0.0
4 3.0 1.0 4.0 3.0 4.0 1.0 3.0 2.0 0.0 3.0 5.0 6.0 4.0 4.0 2.0 7.0 4.0 4.0 6.0 2.0 3.0 2.0 2.0 4.0 2.0 0.0 2.0 4.0 0.0 5.0 2.0 9.0 3.0 0.0 4.0 1992.0 1.0 4.0 1.0 0.0 3.0 2.0 5.0 1.0 5.0 3.0 3.0 5.0 5.0 435.0 2.0 4.0 2.0 1.0 3.0 3.0 4.0 6.0 5.0 0 0 0 70 0 1 4 3 0 1 2.0 0 3.0
5 1.0 2.0 3.0 1.0 5.0 2.0 2.0 5.0 0.0 3.0 3.0 2.0 4.0 7.0 4.0 2.0 2.0 2.0 5.0 7.0 4.0 4.0 4.0 7.0 6.0 0.0 2.0 1.0 0.0 5.0 6.0 9.0 5.0 0.0 5.0 1992.0 2.0 2.0 0.0 0.0 4.0 6.0 2.0 7.0 4.0 4.0 4.0 1.0 5.0 1300.0 2.0 3.0 1.0 1.0 5.0 5.0 2.0 3.0 3.0 0 0 0 50 0 1 5 4 0 1 1.0 1 0.0

Discussion 1.2.2: Engineer Mixed-Type Features

(Double-click this cell and replace this text with your own text, reporting your findings and decisions regarding mixed-value features. Which ones did you keep, which did you drop, and what engineering steps did you perform?)

LP_LEBENSPHASE_FEIN. There are too many dimensions (age, income, home owenrship, family status, parenthood) and levels here. Some of these dimensions are present in other variables just to keep things simple, I dropped it.

LP_LEBENSPHASE_GROB. Family component is already present in other variables, such as LP_FAMILIE_FEIN. Income has only two levels: 1) low and avg & 2) high. There are other variables which indicate income, such as LP_STATUS_FEIN, SP_STATUS_GROS, HH_EINKOMMEN_SCORE. Dropped this column.

PRAEGENDE_JUGENDJAHRE. Converted to PRAEGENDE_JUGENDJAHRE_DECADE and PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE (dummy variable) and PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM (dummy variable).

WOHNLAGE. Converted to WOHNLAGE_RURAL_FLAG (dummy variable) and WOHNLAGE_CITY_NEIGHBOURHOOD

CAMEO_INTL_2015. Converted to CAMEO_INTL_2015_WEALTH and CAMEO_INTL_2015_LIFE_STAGE_TYP.

KBA05_BAUMAX This one was removed earlier as an outlier column with too much missing data.

PLZ8_BAUMAX Converted to PLZ8_BAUMAX_BLDNG_TYPE_BUSINESS (dummy variable), PLZ8_BAUMAX_BLDNG_TYPE_FAMILY (dummy variable), PLZ8_BAUMAX_FAMILY_HOMES.

Step 1.2.3: Complete Feature Selection

In order to finish this step up, you need to make sure that your data frame now only has the columns that you want to keep. To summarize, the dataframe should consist of the following:

  • All numeric, interval, and ordinal type columns from the original dataset.
  • Binary categorical features (all numerically-encoded).
  • Engineered features from other multi-level categorical features and mixed features.

Make sure that for any new columns that you have engineered, that you've excluded the original columns from the final dataset. Otherwise, their values will interfere with the analysis later on the project. For example, you should not keep "PRAEGENDE_JUGENDJAHRE", since its values won't be useful for the algorithm: only the values derived from it in the engineered features you created should be retained. As a reminder, your data should only be from the subset with few or no missing values.

If there are other re-engineering tasks you need to perform, make sure you take care of them here. (Dealing with missing data will come in step 2.1.)

Do whatever you need to in order to ensure that the dataframe only contains the columns that should be passed to the algorithm functions.

In [161]:
print(azdias_cleaned_encoded.shape)
azdias_cleaned_encoded = azdias_cleaned_encoded.drop(['NAN_count','many_missing_values_in_row'], axis = 1)
print(azdias_cleaned_encoded.shape)
(798067, 72)
(798067, 70)
In [162]:
azdias_cleaned_encoded.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 798067 entries, 1 to 891220
Data columns (total 70 columns):
ALTERSKATEGORIE_GROB                         795264 non-null float64
ANREDE_KZ                                    798067 non-null float64
FINANZ_MINIMALIST                            798067 non-null float64
FINANZ_SPARER                                798067 non-null float64
FINANZ_VORSORGER                             798067 non-null float64
FINANZ_ANLEGER                               798067 non-null float64
FINANZ_UNAUFFAELLIGER                        798067 non-null float64
FINANZ_HAUSBAUER                             798067 non-null float64
GREEN_AVANTGARDE                             798067 non-null float64
HEALTH_TYP                                   761341 non-null float64
RETOURTYP_BK_S                               793318 non-null float64
SEMIO_SOZ                                    798067 non-null float64
SEMIO_FAM                                    798067 non-null float64
SEMIO_REL                                    798067 non-null float64
SEMIO_MAT                                    798067 non-null float64
SEMIO_VERT                                   798067 non-null float64
SEMIO_LUST                                   798067 non-null float64
SEMIO_ERL                                    798067 non-null float64
SEMIO_KULT                                   798067 non-null float64
SEMIO_RAT                                    798067 non-null float64
SEMIO_KRIT                                   798067 non-null float64
SEMIO_DOM                                    798067 non-null float64
SEMIO_KAEM                                   798067 non-null float64
SEMIO_PFLICHT                                798067 non-null float64
SEMIO_TRADV                                  798067 non-null float64
SOHO_KZ                                      798067 non-null float64
VERS_TYP                                     761341 non-null float64
ANZ_PERSONEN                                 798067 non-null float64
ANZ_TITEL                                    798067 non-null float64
HH_EINKOMMEN_SCORE                           798067 non-null float64
W_KEIT_KIND_HH                               738762 non-null float64
WOHNDAUER_2008                               798067 non-null float64
ANZ_HAUSHALTE_AKTIV                          791605 non-null float64
ANZ_HH_TITEL                                 794208 non-null float64
KONSUMNAEHE                                  797995 non-null float64
MIN_GEBAEUDEJAHR                             798067 non-null float64
KBA05_ANTG1                                  757897 non-null float64
KBA05_ANTG2                                  757897 non-null float64
KBA05_ANTG3                                  757897 non-null float64
KBA05_ANTG4                                  757897 non-null float64
KBA05_GBZ                                    757897 non-null float64
BALLRAUM                                     797475 non-null float64
EWDICHTE                                     797475 non-null float64
INNENSTADT                                   797475 non-null float64
GEBAEUDETYP_RASTER                           798060 non-null float64
KKK                                          733157 non-null float64
MOBI_REGIO                                   757897 non-null float64
ONLINE_AFFINITAET                            793318 non-null float64
REGIOTYP                                     733157 non-null float64
KBA13_ANZAHL_PKW                             785420 non-null float64
PLZ8_ANTG1                                   774706 non-null float64
PLZ8_ANTG2                                   774706 non-null float64
PLZ8_ANTG3                                   774706 non-null float64
PLZ8_ANTG4                                   774706 non-null float64
PLZ8_HHZ                                     774706 non-null float64
PLZ8_GBZ                                     774706 non-null float64
ARBEIT                                       793840 non-null float64
ORTSGR_KLS9                                  793941 non-null float64
RELAT_AB                                     793840 non-null float64
OST_WEST_KZ_O                                798067 non-null uint8
PRAEGENDE_JUGENDJAHRE_DECADE                 798067 non-null int64
PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE    798067 non-null uint8
PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM    798067 non-null uint8
CAMEO_INTL_2015_WEALTH                       791866 non-null object
CAMEO_INTL_2015_LIFE_STAGE_TYP               791866 non-null object
PLZ8_BAUMAX_BLDNG_TYPE_BUSINESS              798067 non-null uint8
PLZ8_BAUMAX_BLDNG_TYPE_FAMILY                798067 non-null uint8
PLZ8_BAUMAX_FAMILY_HOMES                     774706 non-null float64
WOHNLAGE_RURAL_FLAG                          798067 non-null int64
WOHNLAGE_CITY_NEIGHBOURHOOD                  798067 non-null float64
dtypes: float64(61), int64(2), object(2), uint8(5)
memory usage: 405.7+ MB
In [163]:
for item in list(mixed_vars['attribute']):
    print(item)
    try: 
        print(azdias_cleaned_encoded[item])
    except:
        print('this var has been deleted and/or recoded')
    print()    
LP_LEBENSPHASE_FEIN
this var has been deleted and/or recoded

LP_LEBENSPHASE_GROB
this var has been deleted and/or recoded

PRAEGENDE_JUGENDJAHRE
this var has been deleted and/or recoded

WOHNLAGE
this var has been deleted and/or recoded

CAMEO_INTL_2015
this var has been deleted and/or recoded

KBA05_BAUMAX
this var has been deleted and/or recoded

PLZ8_BAUMAX
this var has been deleted and/or recoded

In [164]:
for item in multilevel:
    print(item)
    try: 
        print(azdias_cleaned_encoded[item])
    except:
        print('this var has been deleted and/or recoded')
    print()    
CJT_GESAMTTYP
this var has been deleted and/or recoded

FINANZTYP
this var has been deleted and/or recoded

GFK_URLAUBERTYP
this var has been deleted and/or recoded

LP_FAMILIE_FEIN
this var has been deleted and/or recoded

LP_FAMILIE_GROB
this var has been deleted and/or recoded

LP_STATUS_FEIN
this var has been deleted and/or recoded

LP_STATUS_GROB
this var has been deleted and/or recoded

NATIONALITAET_KZ
this var has been deleted and/or recoded

SHOPPER_TYP
this var has been deleted and/or recoded

ZABEOTYP
this var has been deleted and/or recoded

GEBAEUDETYP
this var has been deleted and/or recoded

CAMEO_DEUG_2015
this var has been deleted and/or recoded

CAMEO_DEU_2015
this var has been deleted and/or recoded

In [165]:
for item in ['OST_WEST_KZ']:
    print(item)
    try: 
        print(azdias_cleaned_encoded[item])
    except:
        print('this var has been deleted and/or recoded')
    print()     
OST_WEST_KZ
this var has been deleted and/or recoded

In [166]:
for item in ['PRAEGENDE_JUGENDJAHRE', 'CAMEO_INTL_2015']:
    print(item)
    try: 
        print(azdias_cleaned_encoded[item])
    except:
        print('this var has been deleted and/or recoded')
    print()    
PRAEGENDE_JUGENDJAHRE
this var has been deleted and/or recoded

CAMEO_INTL_2015
this var has been deleted and/or recoded

Step 1.3: Create a Cleaning Function

Even though you've finished cleaning up the general population demographics data, it's important to look ahead to the future and realize that you'll need to perform the same cleaning steps on the customer demographics data. In this substep, complete the function below to execute the main feature selection, encoding, and re-engineering steps you performed above. Then, when it comes to looking at the customer data in Step 3, you can just run this function on that DataFrame to get the trimmed dataset in a single step.

In [167]:
def clean_data(df):
    """
    Perform feature trimming, re-encoding, and engineering for demographics
    data

    INPUT: Demographics DataFrame
    OUTPUT: Trimmed and cleaned demographics DataFrame

    NOTE: This function does NOT build a cleaning procedure from scratch.
    It reuses the data transformation functions from previous sections. 
    This will help avoid code repetition and make this function more readable and easy to maintain.
    """

    ###########################################################################
    # Put in code here to execute all main cleaning steps:
    # convert missing value codes into NaNs, ...
    
#    df = azdias.copy() # test - remove later. Keep commented out in PROD
    
    
    df_cleaned = replace_coded_as_missing_unknown_with_NANs(df) 
    
    # summarize NAN for columns
    df_cleaned_NA_report = how_many_NA(df_cleaned)

    # remove selected columns and rows, ...

    # Investigate patterns in the amount of missing data in each column.
    #  the following columns have more than 1/3 of data missing
    outliers_df = df_cleaned_NA_report[df_cleaned_NA_report['missing_NA_percent']>0.333]

    # delete outlier columns with too much missing data
    for outlier_column in outliers_df['Column'].tolist():
        del(df_cleaned[outlier_column])

    ###########################################################################
    # add a column to the df_cleaned with NAN count for each row - takes 3-5 mins to run
    count_NAN_in_each_ROW(df_cleaned)

    # divide into subsets
    df_cleaned_missing_few, df_cleaned_missing_many, threshold = divide_into_subsets(df_cleaned, 30)

    # select, re-encode, and engineer column values.
    #########################################################
    # Make dummy variables for OST_WEST_KZ

    df_cleaned_encoded = pd.concat([df_cleaned_missing_few, pd.get_dummies(df_cleaned_missing_few['OST_WEST_KZ'], prefix='OST_WEST_KZ')], axis=1)



    #########################################################
    # recode PRAEGENDE_JUGENDJAHRE
    # initialize column
    df_cleaned_encoded['PRAEGENDE_JUGENDJAHRE_DECADE'] = 0

    my_dict = {
    40:[1,2],
    50:[3,4],
    60:[5,6,7],
    70:[8,9],
    80:[10,11,12,13],
    90:[14,15]
    }
    
    print(my_dict)
    
    recode_column_according_2_my_dict(df_cleaned_encoded, "PRAEGENDE_JUGENDJAHRE", "PRAEGENDE_JUGENDJAHRE_DECADE",my_dict)
    
    print(pd.crosstab(df_cleaned_encoded["PRAEGENDE_JUGENDJAHRE"], df_cleaned_encoded['PRAEGENDE_JUGENDJAHRE_DECADE']))


    df_cleaned_encoded['PRAEGENDE_JUGENDJAHRE_MOVEMENT_TEMP'] = ""

    my_dict ={
        "AVANTGARDE": [2, 4, 6, 7, 9, 11, 13, 15],
        "MAINSTREAM": [1, 3, 5, 8, 10, 12, 14]
    }

    recode_column_according_2_my_dict(df_cleaned_encoded, "PRAEGENDE_JUGENDJAHRE", "PRAEGENDE_JUGENDJAHRE_MOVEMENT_TEMP",my_dict)

    # Make dummy variables for PRAEGENDE_JUGENDJAHRE_MOVEMENT_TEMP
    df_cleaned_encoded = pd.concat([df_cleaned_encoded, pd.get_dummies(df_cleaned_encoded['PRAEGENDE_JUGENDJAHRE_MOVEMENT_TEMP'], prefix='PRAEGENDE_JUGENDJAHRE_MOVEMENT')], axis=1)


    print(pd.crosstab(df_cleaned_encoded['PRAEGENDE_JUGENDJAHRE_MOVEMENT_TEMP'], df_cleaned_encoded['PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE']))

    print(pd.crosstab(df_cleaned_encoded['PRAEGENDE_JUGENDJAHRE_MOVEMENT_TEMP'], df_cleaned_encoded['PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM']))


    # PRAEGENDE_JUGENDJAHRE_MOVEMENT_ is caused by missing values, we don't need this col, it just tell u number of NAN in the original column PRAEGENDE_JUGENDJAHRE
    # PRAEGENDE_JUGENDJAHRE_MOVEMENT_TEMP is a temporary column, we don't need it either


    #########################################################
    df_cleaned_encoded["CAMEO_INTL_2015_WEALTH"] =df_cleaned_encoded["CAMEO_INTL_2015"].str.slice(0,1)

    df_cleaned_encoded["CAMEO_INTL_2015_LIFE_STAGE_TYP"] =df_cleaned_encoded["CAMEO_INTL_2015"].str.slice(1,2)



    #########################################################

    df_cleaned_encoded['PLZ8_BAUMAX_BLDNG_TYPE_TEMP'] = ""

    my_dict ={
    "FAMILY": [1,2,3,4],
    "BUSINESS": [5]
    }

    recode_column_according_2_my_dict(df_cleaned_encoded, "PLZ8_BAUMAX", "PLZ8_BAUMAX_BLDNG_TYPE_TEMP", my_dict)

    df_cleaned_encoded = pd.concat([df_cleaned_encoded, pd.get_dummies(df_cleaned_encoded['PLZ8_BAUMAX_BLDNG_TYPE_TEMP'], prefix='PLZ8_BAUMAX_BLDNG_TYPE')], axis=1)



    # how many family homes are there?
    condition = df_cleaned_encoded['PLZ8_BAUMAX'] == 5

    # df_cleaned_encoded['PLZ8_BAUMAX_FAMILY_HOMES'] = df_cleaned_encoded['PLZ8_BAUMAX']

    df_cleaned_encoded['PLZ8_BAUMAX_FAMILY_HOMES']  = np.where(condition==True, 0, df_cleaned_encoded['PLZ8_BAUMAX'])


    #########################################################

    condition = df_cleaned_encoded["WOHNLAGE"].isin([7,8])

    df_cleaned_encoded["WOHNLAGE_RURAL_FLAG"] = np.where(condition==True, 1,0)

    # condition = df_cleaned_encoded["WOHNLAGE"].isin([7,8])

    df_cleaned_encoded["WOHNLAGE_CITY_NEIGHBOURHOOD"] = np.where(condition==True, 0,df_cleaned_encoded["WOHNLAGE"])

    #########################################################
    # the variable multilevel comes from code above which analyzes cat variables
    df_cleaned_encoded = df_cleaned_encoded.drop(multilevel, axis = 1)

    drop_list = ['OST_WEST_KZ','OST_WEST_KZ_W','PRAEGENDE_JUGENDJAHRE','PRAEGENDE_JUGENDJAHRE_MOVEMENT_TEMP','PRAEGENDE_JUGENDJAHRE_MOVEMENT_','CAMEO_INTL_2015','PLZ8_BAUMAX_BLDNG_TYPE_TEMP','PLZ8_BAUMAX_BLDNG_TYPE_','PLZ8_BAUMAX','WOHNLAGE','LP_LEBENSPHASE_FEIN','LP_LEBENSPHASE_GROB','NAN_count','many_missing_values_in_row']

    df_cleaned_encoded = df_cleaned_encoded.drop(drop_list, axis = 1)


    # Return the cleaned dataframe.
    return df_cleaned_encoded
In [168]:
azdias_cleaned_encoded_TEST_BIGASS_CLEANING_FUNCTION = clean_data(azdias)
Threshold of missing rows in the row:  30

Total nrow: 891221

Few Missing Values. Will be kept.
798067

Lots of Missing Values. Will be deleted
93154

% of rows with lots of missing values. % of data deleted 
0.104524018173
{40: [1, 2], 50: [3, 4], 60: [5, 6, 7], 70: [8, 9], 80: [10, 11, 12, 13], 90: [14, 15]}
PRAEGENDE_JUGENDJAHRE_DECADE     40     50     60      70     80      90
PRAEGENDE_JUGENDJAHRE                                                   
1.0                           20678      0      0       0      0       0
2.0                            7479      0      0       0      0       0
3.0                               0  53845      0       0      0       0
4.0                               0  20451      0       0      0       0
5.0                               0      0  84692       0      0       0
6.0                               0      0  25652       0      0       0
7.0                               0      0   4010       0      0       0
8.0                               0      0      0  141630      0       0
9.0                               0      0      0   33570      0       0
10.0                              0      0      0       0  85808       0
11.0                              0      0      0       0  35752       0
12.0                              0      0      0       0  24446       0
13.0                              0      0      0       0   5764       0
14.0                              0      0      0       0      0  182985
15.0                              0      0      0       0      0   42547
PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE       0       1
PRAEGENDE_JUGENDJAHRE_MOVEMENT_TEMP                      
                                            28758       0
AVANTGARDE                                      0  175225
MAINSTREAM                                 594084       0
PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM       0       1
PRAEGENDE_JUGENDJAHRE_MOVEMENT_TEMP                      
                                            28758       0
AVANTGARDE                                 175225       0
MAINSTREAM                                      0  594084
In [169]:
# I want to make my cleaning function returns the same resutls as my cleaning steps abovel 
# If it says True below, then we're good
azdias_cleaned_encoded.equals(azdias_cleaned_encoded_TEST_BIGASS_CLEANING_FUNCTION)
Out[169]:
True

Explore the dataframe produced by the cleaning function

In [170]:
azdias_cleaned_encoded.head()
Out[170]:
ALTERSKATEGORIE_GROB ANREDE_KZ FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER GREEN_AVANTGARDE HEALTH_TYP RETOURTYP_BK_S SEMIO_SOZ SEMIO_FAM SEMIO_REL SEMIO_MAT SEMIO_VERT SEMIO_LUST SEMIO_ERL SEMIO_KULT SEMIO_RAT SEMIO_KRIT SEMIO_DOM SEMIO_KAEM SEMIO_PFLICHT SEMIO_TRADV SOHO_KZ VERS_TYP ANZ_PERSONEN ANZ_TITEL HH_EINKOMMEN_SCORE W_KEIT_KIND_HH WOHNDAUER_2008 ANZ_HAUSHALTE_AKTIV ANZ_HH_TITEL KONSUMNAEHE MIN_GEBAEUDEJAHR KBA05_ANTG1 KBA05_ANTG2 KBA05_ANTG3 KBA05_ANTG4 KBA05_GBZ BALLRAUM EWDICHTE INNENSTADT GEBAEUDETYP_RASTER KKK MOBI_REGIO ONLINE_AFFINITAET REGIOTYP KBA13_ANZAHL_PKW PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB OST_WEST_KZ_O PRAEGENDE_JUGENDJAHRE_DECADE PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM CAMEO_INTL_2015_WEALTH CAMEO_INTL_2015_LIFE_STAGE_TYP PLZ8_BAUMAX_BLDNG_TYPE_BUSINESS PLZ8_BAUMAX_BLDNG_TYPE_FAMILY PLZ8_BAUMAX_FAMILY_HOMES WOHNLAGE_RURAL_FLAG WOHNLAGE_CITY_NEIGHBOURHOOD
1 1.0 2.0 1.0 5.0 2.0 5.0 4.0 5.0 0.0 3.0 1.0 5.0 4.0 4.0 3.0 1.0 2.0 2.0 3.0 6.0 4.0 7.0 4.0 7.0 6.0 1.0 2.0 2.0 0.0 6.0 3.0 9.0 11.0 0.0 1.0 1992.0 0.0 0.0 0.0 2.0 1.0 6.0 3.0 8.0 3.0 2.0 1.0 3.0 3.0 963.0 2.0 3.0 2.0 1.0 5.0 4.0 3.0 5.0 4.0 0 90 0 1 5 1 0 1 1.0 0 4.0
2 3.0 2.0 1.0 4.0 1.0 2.0 3.0 5.0 1.0 3.0 3.0 4.0 1.0 3.0 3.0 4.0 4.0 6.0 3.0 4.0 7.0 7.0 7.0 3.0 3.0 0.0 1.0 1.0 0.0 4.0 3.0 9.0 10.0 0.0 5.0 1992.0 1.0 3.0 1.0 0.0 3.0 2.0 4.0 4.0 4.0 2.0 3.0 2.0 2.0 712.0 3.0 3.0 1.0 0.0 4.0 4.0 3.0 5.0 2.0 0 90 1 0 2 4 0 1 1.0 0 2.0
3 4.0 2.0 4.0 2.0 5.0 2.0 1.0 2.0 0.0 2.0 2.0 5.0 1.0 2.0 1.0 4.0 4.0 7.0 4.0 3.0 4.0 4.0 5.0 4.0 4.0 0.0 1.0 0.0 0.0 1.0 NaN 9.0 1.0 0.0 4.0 1997.0 4.0 1.0 0.0 0.0 4.0 4.0 2.0 6.0 4.0 NaN 4.0 1.0 NaN 596.0 2.0 2.0 2.0 0.0 3.0 4.0 2.0 3.0 3.0 0 70 0 1 1 2 0 1 1.0 1 0.0
4 3.0 1.0 4.0 3.0 4.0 1.0 3.0 2.0 0.0 3.0 5.0 6.0 4.0 4.0 2.0 7.0 4.0 4.0 6.0 2.0 3.0 2.0 2.0 4.0 2.0 0.0 2.0 4.0 0.0 5.0 2.0 9.0 3.0 0.0 4.0 1992.0 1.0 4.0 1.0 0.0 3.0 2.0 5.0 1.0 5.0 3.0 3.0 5.0 5.0 435.0 2.0 4.0 2.0 1.0 3.0 3.0 4.0 6.0 5.0 0 70 0 1 4 3 0 1 2.0 0 3.0
5 1.0 2.0 3.0 1.0 5.0 2.0 2.0 5.0 0.0 3.0 3.0 2.0 4.0 7.0 4.0 2.0 2.0 2.0 5.0 7.0 4.0 4.0 4.0 7.0 6.0 0.0 2.0 1.0 0.0 5.0 6.0 9.0 5.0 0.0 5.0 1992.0 2.0 2.0 0.0 0.0 4.0 6.0 2.0 7.0 4.0 4.0 4.0 1.0 5.0 1300.0 2.0 3.0 1.0 1.0 5.0 5.0 2.0 3.0 3.0 0 50 0 1 5 4 0 1 1.0 1 0.0
In [171]:
azdias_cleaned_encoded_TEST_BIGASS_CLEANING_FUNCTION.head()
Out[171]:
ALTERSKATEGORIE_GROB ANREDE_KZ FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER GREEN_AVANTGARDE HEALTH_TYP RETOURTYP_BK_S SEMIO_SOZ SEMIO_FAM SEMIO_REL SEMIO_MAT SEMIO_VERT SEMIO_LUST SEMIO_ERL SEMIO_KULT SEMIO_RAT SEMIO_KRIT SEMIO_DOM SEMIO_KAEM SEMIO_PFLICHT SEMIO_TRADV SOHO_KZ VERS_TYP ANZ_PERSONEN ANZ_TITEL HH_EINKOMMEN_SCORE W_KEIT_KIND_HH WOHNDAUER_2008 ANZ_HAUSHALTE_AKTIV ANZ_HH_TITEL KONSUMNAEHE MIN_GEBAEUDEJAHR KBA05_ANTG1 KBA05_ANTG2 KBA05_ANTG3 KBA05_ANTG4 KBA05_GBZ BALLRAUM EWDICHTE INNENSTADT GEBAEUDETYP_RASTER KKK MOBI_REGIO ONLINE_AFFINITAET REGIOTYP KBA13_ANZAHL_PKW PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB OST_WEST_KZ_O PRAEGENDE_JUGENDJAHRE_DECADE PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM CAMEO_INTL_2015_WEALTH CAMEO_INTL_2015_LIFE_STAGE_TYP PLZ8_BAUMAX_BLDNG_TYPE_BUSINESS PLZ8_BAUMAX_BLDNG_TYPE_FAMILY PLZ8_BAUMAX_FAMILY_HOMES WOHNLAGE_RURAL_FLAG WOHNLAGE_CITY_NEIGHBOURHOOD
1 1.0 2.0 1.0 5.0 2.0 5.0 4.0 5.0 0.0 3.0 1.0 5.0 4.0 4.0 3.0 1.0 2.0 2.0 3.0 6.0 4.0 7.0 4.0 7.0 6.0 1.0 2.0 2.0 0.0 6.0 3.0 9.0 11.0 0.0 1.0 1992.0 0.0 0.0 0.0 2.0 1.0 6.0 3.0 8.0 3.0 2.0 1.0 3.0 3.0 963.0 2.0 3.0 2.0 1.0 5.0 4.0 3.0 5.0 4.0 0 90 0 1 5 1 0 1 1.0 0 4.0
2 3.0 2.0 1.0 4.0 1.0 2.0 3.0 5.0 1.0 3.0 3.0 4.0 1.0 3.0 3.0 4.0 4.0 6.0 3.0 4.0 7.0 7.0 7.0 3.0 3.0 0.0 1.0 1.0 0.0 4.0 3.0 9.0 10.0 0.0 5.0 1992.0 1.0 3.0 1.0 0.0 3.0 2.0 4.0 4.0 4.0 2.0 3.0 2.0 2.0 712.0 3.0 3.0 1.0 0.0 4.0 4.0 3.0 5.0 2.0 0 90 1 0 2 4 0 1 1.0 0 2.0
3 4.0 2.0 4.0 2.0 5.0 2.0 1.0 2.0 0.0 2.0 2.0 5.0 1.0 2.0 1.0 4.0 4.0 7.0 4.0 3.0 4.0 4.0 5.0 4.0 4.0 0.0 1.0 0.0 0.0 1.0 NaN 9.0 1.0 0.0 4.0 1997.0 4.0 1.0 0.0 0.0 4.0 4.0 2.0 6.0 4.0 NaN 4.0 1.0 NaN 596.0 2.0 2.0 2.0 0.0 3.0 4.0 2.0 3.0 3.0 0 70 0 1 1 2 0 1 1.0 1 0.0
4 3.0 1.0 4.0 3.0 4.0 1.0 3.0 2.0 0.0 3.0 5.0 6.0 4.0 4.0 2.0 7.0 4.0 4.0 6.0 2.0 3.0 2.0 2.0 4.0 2.0 0.0 2.0 4.0 0.0 5.0 2.0 9.0 3.0 0.0 4.0 1992.0 1.0 4.0 1.0 0.0 3.0 2.0 5.0 1.0 5.0 3.0 3.0 5.0 5.0 435.0 2.0 4.0 2.0 1.0 3.0 3.0 4.0 6.0 5.0 0 70 0 1 4 3 0 1 2.0 0 3.0
5 1.0 2.0 3.0 1.0 5.0 2.0 2.0 5.0 0.0 3.0 3.0 2.0 4.0 7.0 4.0 2.0 2.0 2.0 5.0 7.0 4.0 4.0 4.0 7.0 6.0 0.0 2.0 1.0 0.0 5.0 6.0 9.0 5.0 0.0 5.0 1992.0 2.0 2.0 0.0 0.0 4.0 6.0 2.0 7.0 4.0 4.0 4.0 1.0 5.0 1300.0 2.0 3.0 1.0 1.0 5.0 5.0 2.0 3.0 3.0 0 50 0 1 5 4 0 1 1.0 1 0.0
In [172]:
azdias_cleaned_encoded.shape
Out[172]:
(798067, 70)
In [173]:
azdias_cleaned_encoded_TEST_BIGASS_CLEANING_FUNCTION.shape
Out[173]:
(798067, 70)
In [174]:
azdias_cleaned_encoded.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 798067 entries, 1 to 891220
Data columns (total 70 columns):
ALTERSKATEGORIE_GROB                         795264 non-null float64
ANREDE_KZ                                    798067 non-null float64
FINANZ_MINIMALIST                            798067 non-null float64
FINANZ_SPARER                                798067 non-null float64
FINANZ_VORSORGER                             798067 non-null float64
FINANZ_ANLEGER                               798067 non-null float64
FINANZ_UNAUFFAELLIGER                        798067 non-null float64
FINANZ_HAUSBAUER                             798067 non-null float64
GREEN_AVANTGARDE                             798067 non-null float64
HEALTH_TYP                                   761341 non-null float64
RETOURTYP_BK_S                               793318 non-null float64
SEMIO_SOZ                                    798067 non-null float64
SEMIO_FAM                                    798067 non-null float64
SEMIO_REL                                    798067 non-null float64
SEMIO_MAT                                    798067 non-null float64
SEMIO_VERT                                   798067 non-null float64
SEMIO_LUST                                   798067 non-null float64
SEMIO_ERL                                    798067 non-null float64
SEMIO_KULT                                   798067 non-null float64
SEMIO_RAT                                    798067 non-null float64
SEMIO_KRIT                                   798067 non-null float64
SEMIO_DOM                                    798067 non-null float64
SEMIO_KAEM                                   798067 non-null float64
SEMIO_PFLICHT                                798067 non-null float64
SEMIO_TRADV                                  798067 non-null float64
SOHO_KZ                                      798067 non-null float64
VERS_TYP                                     761341 non-null float64
ANZ_PERSONEN                                 798067 non-null float64
ANZ_TITEL                                    798067 non-null float64
HH_EINKOMMEN_SCORE                           798067 non-null float64
W_KEIT_KIND_HH                               738762 non-null float64
WOHNDAUER_2008                               798067 non-null float64
ANZ_HAUSHALTE_AKTIV                          791605 non-null float64
ANZ_HH_TITEL                                 794208 non-null float64
KONSUMNAEHE                                  797995 non-null float64
MIN_GEBAEUDEJAHR                             798067 non-null float64
KBA05_ANTG1                                  757897 non-null float64
KBA05_ANTG2                                  757897 non-null float64
KBA05_ANTG3                                  757897 non-null float64
KBA05_ANTG4                                  757897 non-null float64
KBA05_GBZ                                    757897 non-null float64
BALLRAUM                                     797475 non-null float64
EWDICHTE                                     797475 non-null float64
INNENSTADT                                   797475 non-null float64
GEBAEUDETYP_RASTER                           798060 non-null float64
KKK                                          733157 non-null float64
MOBI_REGIO                                   757897 non-null float64
ONLINE_AFFINITAET                            793318 non-null float64
REGIOTYP                                     733157 non-null float64
KBA13_ANZAHL_PKW                             785420 non-null float64
PLZ8_ANTG1                                   774706 non-null float64
PLZ8_ANTG2                                   774706 non-null float64
PLZ8_ANTG3                                   774706 non-null float64
PLZ8_ANTG4                                   774706 non-null float64
PLZ8_HHZ                                     774706 non-null float64
PLZ8_GBZ                                     774706 non-null float64
ARBEIT                                       793840 non-null float64
ORTSGR_KLS9                                  793941 non-null float64
RELAT_AB                                     793840 non-null float64
OST_WEST_KZ_O                                798067 non-null uint8
PRAEGENDE_JUGENDJAHRE_DECADE                 798067 non-null int64
PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE    798067 non-null uint8
PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM    798067 non-null uint8
CAMEO_INTL_2015_WEALTH                       791866 non-null object
CAMEO_INTL_2015_LIFE_STAGE_TYP               791866 non-null object
PLZ8_BAUMAX_BLDNG_TYPE_BUSINESS              798067 non-null uint8
PLZ8_BAUMAX_BLDNG_TYPE_FAMILY                798067 non-null uint8
PLZ8_BAUMAX_FAMILY_HOMES                     774706 non-null float64
WOHNLAGE_RURAL_FLAG                          798067 non-null int64
WOHNLAGE_CITY_NEIGHBOURHOOD                  798067 non-null float64
dtypes: float64(61), int64(2), object(2), uint8(5)
memory usage: 405.7+ MB
In [175]:
azdias_cleaned_encoded_TEST_BIGASS_CLEANING_FUNCTION.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 798067 entries, 1 to 891220
Data columns (total 70 columns):
ALTERSKATEGORIE_GROB                         795264 non-null float64
ANREDE_KZ                                    798067 non-null float64
FINANZ_MINIMALIST                            798067 non-null float64
FINANZ_SPARER                                798067 non-null float64
FINANZ_VORSORGER                             798067 non-null float64
FINANZ_ANLEGER                               798067 non-null float64
FINANZ_UNAUFFAELLIGER                        798067 non-null float64
FINANZ_HAUSBAUER                             798067 non-null float64
GREEN_AVANTGARDE                             798067 non-null float64
HEALTH_TYP                                   761341 non-null float64
RETOURTYP_BK_S                               793318 non-null float64
SEMIO_SOZ                                    798067 non-null float64
SEMIO_FAM                                    798067 non-null float64
SEMIO_REL                                    798067 non-null float64
SEMIO_MAT                                    798067 non-null float64
SEMIO_VERT                                   798067 non-null float64
SEMIO_LUST                                   798067 non-null float64
SEMIO_ERL                                    798067 non-null float64
SEMIO_KULT                                   798067 non-null float64
SEMIO_RAT                                    798067 non-null float64
SEMIO_KRIT                                   798067 non-null float64
SEMIO_DOM                                    798067 non-null float64
SEMIO_KAEM                                   798067 non-null float64
SEMIO_PFLICHT                                798067 non-null float64
SEMIO_TRADV                                  798067 non-null float64
SOHO_KZ                                      798067 non-null float64
VERS_TYP                                     761341 non-null float64
ANZ_PERSONEN                                 798067 non-null float64
ANZ_TITEL                                    798067 non-null float64
HH_EINKOMMEN_SCORE                           798067 non-null float64
W_KEIT_KIND_HH                               738762 non-null float64
WOHNDAUER_2008                               798067 non-null float64
ANZ_HAUSHALTE_AKTIV                          791605 non-null float64
ANZ_HH_TITEL                                 794208 non-null float64
KONSUMNAEHE                                  797995 non-null float64
MIN_GEBAEUDEJAHR                             798067 non-null float64
KBA05_ANTG1                                  757897 non-null float64
KBA05_ANTG2                                  757897 non-null float64
KBA05_ANTG3                                  757897 non-null float64
KBA05_ANTG4                                  757897 non-null float64
KBA05_GBZ                                    757897 non-null float64
BALLRAUM                                     797475 non-null float64
EWDICHTE                                     797475 non-null float64
INNENSTADT                                   797475 non-null float64
GEBAEUDETYP_RASTER                           798060 non-null float64
KKK                                          733157 non-null float64
MOBI_REGIO                                   757897 non-null float64
ONLINE_AFFINITAET                            793318 non-null float64
REGIOTYP                                     733157 non-null float64
KBA13_ANZAHL_PKW                             785420 non-null float64
PLZ8_ANTG1                                   774706 non-null float64
PLZ8_ANTG2                                   774706 non-null float64
PLZ8_ANTG3                                   774706 non-null float64
PLZ8_ANTG4                                   774706 non-null float64
PLZ8_HHZ                                     774706 non-null float64
PLZ8_GBZ                                     774706 non-null float64
ARBEIT                                       793840 non-null float64
ORTSGR_KLS9                                  793941 non-null float64
RELAT_AB                                     793840 non-null float64
OST_WEST_KZ_O                                798067 non-null uint8
PRAEGENDE_JUGENDJAHRE_DECADE                 798067 non-null int64
PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE    798067 non-null uint8
PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM    798067 non-null uint8
CAMEO_INTL_2015_WEALTH                       791866 non-null object
CAMEO_INTL_2015_LIFE_STAGE_TYP               791866 non-null object
PLZ8_BAUMAX_BLDNG_TYPE_BUSINESS              798067 non-null uint8
PLZ8_BAUMAX_BLDNG_TYPE_FAMILY                798067 non-null uint8
PLZ8_BAUMAX_FAMILY_HOMES                     774706 non-null float64
WOHNLAGE_RURAL_FLAG                          798067 non-null int64
WOHNLAGE_CITY_NEIGHBOURHOOD                  798067 non-null float64
dtypes: float64(61), int64(2), object(2), uint8(5)
memory usage: 405.7+ MB

Step 2: Feature Transformation

Step 2.1: Apply Feature Scaling

Before we apply dimensionality reduction techniques to the data, we need to perform feature scaling so that the principal component vectors are not influenced by the natural differences in scale for features. Starting from this part of the project, you'll want to keep an eye on the API reference page for sklearn to help you navigate to all of the classes and functions that you'll need. In this substep, you'll need to check the following:

  • sklearn requires that data not have missing values in order for its estimators to work properly. So, before applying the scaler to your data, make sure that you've cleaned the DataFrame of the remaining missing values. This can be as simple as just removing all data points with missing data, or applying an Imputer to replace all missing values. You might also try a more complicated procedure where you temporarily remove missing values in order to compute the scaling parameters before re-introducing those missing values and applying imputation. Think about how much missing data you have and what possible effects each approach might have on your analysis, and justify your decision in the discussion section below.
  • For the actual scaling function, a StandardScaler instance is suggested, scaling each feature to mean 0 and standard deviation 1.
  • For these classes, you can make use of the .fit_transform() method to both fit a procedure to the data as well as apply the transformation to the data at the same time. Don't forget to keep the fit sklearn objects handy, since you'll be applying them to the customer demographics data towards the end of the project.

If you've not yet cleaned the dataset of all NaN values, then investigate and do that now.

In [176]:
# how many columns have missing values
azdias_cleaned_encoded.isnull().any().sum()  
Out[176]:
34
In [177]:
azdias_cleaned_encoded.isnull().sum()   
Out[177]:
ALTERSKATEGORIE_GROB                          2803
ANREDE_KZ                                        0
FINANZ_MINIMALIST                                0
FINANZ_SPARER                                    0
FINANZ_VORSORGER                                 0
FINANZ_ANLEGER                                   0
FINANZ_UNAUFFAELLIGER                            0
FINANZ_HAUSBAUER                                 0
GREEN_AVANTGARDE                                 0
HEALTH_TYP                                   36726
RETOURTYP_BK_S                                4749
SEMIO_SOZ                                        0
SEMIO_FAM                                        0
SEMIO_REL                                        0
SEMIO_MAT                                        0
SEMIO_VERT                                       0
SEMIO_LUST                                       0
SEMIO_ERL                                        0
SEMIO_KULT                                       0
SEMIO_RAT                                        0
SEMIO_KRIT                                       0
SEMIO_DOM                                        0
SEMIO_KAEM                                       0
SEMIO_PFLICHT                                    0
SEMIO_TRADV                                      0
SOHO_KZ                                          0
VERS_TYP                                     36726
ANZ_PERSONEN                                     0
ANZ_TITEL                                        0
HH_EINKOMMEN_SCORE                               0
W_KEIT_KIND_HH                               59305
WOHNDAUER_2008                                   0
ANZ_HAUSHALTE_AKTIV                           6462
ANZ_HH_TITEL                                  3859
KONSUMNAEHE                                     72
MIN_GEBAEUDEJAHR                                 0
KBA05_ANTG1                                  40170
KBA05_ANTG2                                  40170
KBA05_ANTG3                                  40170
KBA05_ANTG4                                  40170
KBA05_GBZ                                    40170
BALLRAUM                                       592
EWDICHTE                                       592
INNENSTADT                                     592
GEBAEUDETYP_RASTER                               7
KKK                                          64910
MOBI_REGIO                                   40170
ONLINE_AFFINITAET                             4749
REGIOTYP                                     64910
KBA13_ANZAHL_PKW                             12647
PLZ8_ANTG1                                   23361
PLZ8_ANTG2                                   23361
PLZ8_ANTG3                                   23361
PLZ8_ANTG4                                   23361
PLZ8_HHZ                                     23361
PLZ8_GBZ                                     23361
ARBEIT                                        4227
ORTSGR_KLS9                                   4126
RELAT_AB                                      4227
OST_WEST_KZ_O                                    0
PRAEGENDE_JUGENDJAHRE_DECADE                     0
PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE        0
PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM        0
CAMEO_INTL_2015_WEALTH                        6201
CAMEO_INTL_2015_LIFE_STAGE_TYP                6201
PLZ8_BAUMAX_BLDNG_TYPE_BUSINESS                  0
PLZ8_BAUMAX_BLDNG_TYPE_FAMILY                    0
PLZ8_BAUMAX_FAMILY_HOMES                     23361
WOHNLAGE_RURAL_FLAG                              0
WOHNLAGE_CITY_NEIGHBOURHOOD                      0
dtype: int64
In [178]:
# this is a test of using StandardScaler on a dataset with missing values, ignoring NAN
#Source: https://stackoverflow.com/questions/50897516/assigning-nan-to-1-after-performing-standardscaler
# https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
# The standard score of a sample x is calculated as:

# z = (x - u) / s
    
    
import numpy as np 
import pandas as pd
from sklearn.preprocessing import StandardScaler

#Create a dataframe
df = pd.DataFrame({'col1': [0, np.nan, 2, 3, np.nan, 4, 5, np.nan, 6, 7, np.nan]})
print(df)

#Get the index of null values
null_values = df['col1'].isnull()

df_copy = df.copy()
#Perform standard scalar on only non-NaN values
df_copy.loc[~null_values, ['col1']] = StandardScaler().fit_transform(df.loc[~null_values, ['col1']])
print(df_copy)

#https://medium.com/@seb231/principal-component-analysis-with-missing-data-9e28f440ce93
# https://scikit-learn.org/dev/modules/impute.html
# https://stackoverflow.com/questions/29420737/pca-with-missing-values-in-python

# my understanding that even if i were to do it, i may run into issues later on while conducing PCA
    col1
0    0.0
1    NaN
2    2.0
3    3.0
4    NaN
5    4.0
6    5.0
7    NaN
8    6.0
9    7.0
10   NaN
        col1
0  -1.728498
1        NaN
2  -0.832240
3  -0.384111
4        NaN
5   0.064018
6   0.512148
7        NaN
8   0.960277
9   1.408406
10       NaN
In [179]:
azdias_cleaned_encoded.head()
Out[179]:
ALTERSKATEGORIE_GROB ANREDE_KZ FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER GREEN_AVANTGARDE HEALTH_TYP RETOURTYP_BK_S SEMIO_SOZ SEMIO_FAM SEMIO_REL SEMIO_MAT SEMIO_VERT SEMIO_LUST SEMIO_ERL SEMIO_KULT SEMIO_RAT SEMIO_KRIT SEMIO_DOM SEMIO_KAEM SEMIO_PFLICHT SEMIO_TRADV SOHO_KZ VERS_TYP ANZ_PERSONEN ANZ_TITEL HH_EINKOMMEN_SCORE W_KEIT_KIND_HH WOHNDAUER_2008 ANZ_HAUSHALTE_AKTIV ANZ_HH_TITEL KONSUMNAEHE MIN_GEBAEUDEJAHR KBA05_ANTG1 KBA05_ANTG2 KBA05_ANTG3 KBA05_ANTG4 KBA05_GBZ BALLRAUM EWDICHTE INNENSTADT GEBAEUDETYP_RASTER KKK MOBI_REGIO ONLINE_AFFINITAET REGIOTYP KBA13_ANZAHL_PKW PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB OST_WEST_KZ_O PRAEGENDE_JUGENDJAHRE_DECADE PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM CAMEO_INTL_2015_WEALTH CAMEO_INTL_2015_LIFE_STAGE_TYP PLZ8_BAUMAX_BLDNG_TYPE_BUSINESS PLZ8_BAUMAX_BLDNG_TYPE_FAMILY PLZ8_BAUMAX_FAMILY_HOMES WOHNLAGE_RURAL_FLAG WOHNLAGE_CITY_NEIGHBOURHOOD
1 1.0 2.0 1.0 5.0 2.0 5.0 4.0 5.0 0.0 3.0 1.0 5.0 4.0 4.0 3.0 1.0 2.0 2.0 3.0 6.0 4.0 7.0 4.0 7.0 6.0 1.0 2.0 2.0 0.0 6.0 3.0 9.0 11.0 0.0 1.0 1992.0 0.0 0.0 0.0 2.0 1.0 6.0 3.0 8.0 3.0 2.0 1.0 3.0 3.0 963.0 2.0 3.0 2.0 1.0 5.0 4.0 3.0 5.0 4.0 0 90 0 1 5 1 0 1 1.0 0 4.0
2 3.0 2.0 1.0 4.0 1.0 2.0 3.0 5.0 1.0 3.0 3.0 4.0 1.0 3.0 3.0 4.0 4.0 6.0 3.0 4.0 7.0 7.0 7.0 3.0 3.0 0.0 1.0 1.0 0.0 4.0 3.0 9.0 10.0 0.0 5.0 1992.0 1.0 3.0 1.0 0.0 3.0 2.0 4.0 4.0 4.0 2.0 3.0 2.0 2.0 712.0 3.0 3.0 1.0 0.0 4.0 4.0 3.0 5.0 2.0 0 90 1 0 2 4 0 1 1.0 0 2.0
3 4.0 2.0 4.0 2.0 5.0 2.0 1.0 2.0 0.0 2.0 2.0 5.0 1.0 2.0 1.0 4.0 4.0 7.0 4.0 3.0 4.0 4.0 5.0 4.0 4.0 0.0 1.0 0.0 0.0 1.0 NaN 9.0 1.0 0.0 4.0 1997.0 4.0 1.0 0.0 0.0 4.0 4.0 2.0 6.0 4.0 NaN 4.0 1.0 NaN 596.0 2.0 2.0 2.0 0.0 3.0 4.0 2.0 3.0 3.0 0 70 0 1 1 2 0 1 1.0 1 0.0
4 3.0 1.0 4.0 3.0 4.0 1.0 3.0 2.0 0.0 3.0 5.0 6.0 4.0 4.0 2.0 7.0 4.0 4.0 6.0 2.0 3.0 2.0 2.0 4.0 2.0 0.0 2.0 4.0 0.0 5.0 2.0 9.0 3.0 0.0 4.0 1992.0 1.0 4.0 1.0 0.0 3.0 2.0 5.0 1.0 5.0 3.0 3.0 5.0 5.0 435.0 2.0 4.0 2.0 1.0 3.0 3.0 4.0 6.0 5.0 0 70 0 1 4 3 0 1 2.0 0 3.0
5 1.0 2.0 3.0 1.0 5.0 2.0 2.0 5.0 0.0 3.0 3.0 2.0 4.0 7.0 4.0 2.0 2.0 2.0 5.0 7.0 4.0 4.0 4.0 7.0 6.0 0.0 2.0 1.0 0.0 5.0 6.0 9.0 5.0 0.0 5.0 1992.0 2.0 2.0 0.0 0.0 4.0 6.0 2.0 7.0 4.0 4.0 4.0 1.0 5.0 1300.0 2.0 3.0 1.0 1.0 5.0 5.0 2.0 3.0 3.0 0 50 0 1 5 4 0 1 1.0 1 0.0
In [180]:
azdias_cleaned_encoded_NA_report = how_many_NA(azdias_cleaned_encoded)
azdias_cleaned_encoded_NA_report
Out[180]:
Column missing_NA missing_NA_percent
0 ALTERSKATEGORIE_GROB 2803 0.003512
1 ANREDE_KZ 0 0.000000
2 FINANZ_MINIMALIST 0 0.000000
3 FINANZ_SPARER 0 0.000000
4 FINANZ_VORSORGER 0 0.000000
5 FINANZ_ANLEGER 0 0.000000
6 FINANZ_UNAUFFAELLIGER 0 0.000000
7 FINANZ_HAUSBAUER 0 0.000000
8 GREEN_AVANTGARDE 0 0.000000
9 HEALTH_TYP 36726 0.046019
10 RETOURTYP_BK_S 4749 0.005951
11 SEMIO_SOZ 0 0.000000
12 SEMIO_FAM 0 0.000000
13 SEMIO_REL 0 0.000000
14 SEMIO_MAT 0 0.000000
15 SEMIO_VERT 0 0.000000
16 SEMIO_LUST 0 0.000000
17 SEMIO_ERL 0 0.000000
18 SEMIO_KULT 0 0.000000
19 SEMIO_RAT 0 0.000000
20 SEMIO_KRIT 0 0.000000
21 SEMIO_DOM 0 0.000000
22 SEMIO_KAEM 0 0.000000
23 SEMIO_PFLICHT 0 0.000000
24 SEMIO_TRADV 0 0.000000
25 SOHO_KZ 0 0.000000
26 VERS_TYP 36726 0.046019
27 ANZ_PERSONEN 0 0.000000
28 ANZ_TITEL 0 0.000000
29 HH_EINKOMMEN_SCORE 0 0.000000
30 W_KEIT_KIND_HH 59305 0.074311
31 WOHNDAUER_2008 0 0.000000
32 ANZ_HAUSHALTE_AKTIV 6462 0.008097
33 ANZ_HH_TITEL 3859 0.004835
34 KONSUMNAEHE 72 0.000090
35 MIN_GEBAEUDEJAHR 0 0.000000
36 KBA05_ANTG1 40170 0.050334
37 KBA05_ANTG2 40170 0.050334
38 KBA05_ANTG3 40170 0.050334
39 KBA05_ANTG4 40170 0.050334
40 KBA05_GBZ 40170 0.050334
41 BALLRAUM 592 0.000742
42 EWDICHTE 592 0.000742
43 INNENSTADT 592 0.000742
44 GEBAEUDETYP_RASTER 7 0.000009
45 KKK 64910 0.081334
46 MOBI_REGIO 40170 0.050334
47 ONLINE_AFFINITAET 4749 0.005951
48 REGIOTYP 64910 0.081334
49 KBA13_ANZAHL_PKW 12647 0.015847
50 PLZ8_ANTG1 23361 0.029272
51 PLZ8_ANTG2 23361 0.029272
52 PLZ8_ANTG3 23361 0.029272
53 PLZ8_ANTG4 23361 0.029272
54 PLZ8_HHZ 23361 0.029272
55 PLZ8_GBZ 23361 0.029272
56 ARBEIT 4227 0.005297
57 ORTSGR_KLS9 4126 0.005170
58 RELAT_AB 4227 0.005297
59 OST_WEST_KZ_O 0 0.000000
60 PRAEGENDE_JUGENDJAHRE_DECADE 0 0.000000
61 PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE 0 0.000000
62 PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM 0 0.000000
63 CAMEO_INTL_2015_WEALTH 6201 0.007770
64 CAMEO_INTL_2015_LIFE_STAGE_TYP 6201 0.007770
65 PLZ8_BAUMAX_BLDNG_TYPE_BUSINESS 0 0.000000
66 PLZ8_BAUMAX_BLDNG_TYPE_FAMILY 0 0.000000
67 PLZ8_BAUMAX_FAMILY_HOMES 23361 0.029272
68 WOHNLAGE_RURAL_FLAG 0 0.000000
69 WOHNLAGE_CITY_NEIGHBOURHOOD 0 0.000000
In [181]:
# columns with missing data
azdias_cleaned_encoded_NA_report[azdias_cleaned_encoded_NA_report['missing_NA'] != 0].sort_values(by='missing_NA_percent',ascending=False)
Out[181]:
Column missing_NA missing_NA_percent
45 KKK 64910 0.081334
48 REGIOTYP 64910 0.081334
30 W_KEIT_KIND_HH 59305 0.074311
37 KBA05_ANTG2 40170 0.050334
40 KBA05_GBZ 40170 0.050334
39 KBA05_ANTG4 40170 0.050334
38 KBA05_ANTG3 40170 0.050334
36 KBA05_ANTG1 40170 0.050334
46 MOBI_REGIO 40170 0.050334
9 HEALTH_TYP 36726 0.046019
26 VERS_TYP 36726 0.046019
52 PLZ8_ANTG3 23361 0.029272
53 PLZ8_ANTG4 23361 0.029272
54 PLZ8_HHZ 23361 0.029272
51 PLZ8_ANTG2 23361 0.029272
50 PLZ8_ANTG1 23361 0.029272
55 PLZ8_GBZ 23361 0.029272
67 PLZ8_BAUMAX_FAMILY_HOMES 23361 0.029272
49 KBA13_ANZAHL_PKW 12647 0.015847
32 ANZ_HAUSHALTE_AKTIV 6462 0.008097
63 CAMEO_INTL_2015_WEALTH 6201 0.007770
64 CAMEO_INTL_2015_LIFE_STAGE_TYP 6201 0.007770
47 ONLINE_AFFINITAET 4749 0.005951
10 RETOURTYP_BK_S 4749 0.005951
56 ARBEIT 4227 0.005297
58 RELAT_AB 4227 0.005297
57 ORTSGR_KLS9 4126 0.005170
33 ANZ_HH_TITEL 3859 0.004835
0 ALTERSKATEGORIE_GROB 2803 0.003512
43 INNENSTADT 592 0.000742
42 EWDICHTE 592 0.000742
41 BALLRAUM 592 0.000742
34 KONSUMNAEHE 72 0.000090
44 GEBAEUDETYP_RASTER 7 0.000009
In [182]:
# list of columns with missing data
list_of_columns_with_missing_data = list(azdias_cleaned_encoded_NA_report[azdias_cleaned_encoded_NA_report['missing_NA'] != 0]['Column'])
list_of_columns_with_missing_data
Out[182]:
['ALTERSKATEGORIE_GROB',
 'HEALTH_TYP',
 'RETOURTYP_BK_S',
 'VERS_TYP',
 'W_KEIT_KIND_HH',
 'ANZ_HAUSHALTE_AKTIV',
 'ANZ_HH_TITEL',
 'KONSUMNAEHE',
 'KBA05_ANTG1',
 'KBA05_ANTG2',
 'KBA05_ANTG3',
 'KBA05_ANTG4',
 'KBA05_GBZ',
 'BALLRAUM',
 'EWDICHTE',
 'INNENSTADT',
 'GEBAEUDETYP_RASTER',
 'KKK',
 'MOBI_REGIO',
 'ONLINE_AFFINITAET',
 'REGIOTYP',
 'KBA13_ANZAHL_PKW',
 'PLZ8_ANTG1',
 'PLZ8_ANTG2',
 'PLZ8_ANTG3',
 'PLZ8_ANTG4',
 'PLZ8_HHZ',
 'PLZ8_GBZ',
 'ARBEIT',
 'ORTSGR_KLS9',
 'RELAT_AB',
 'CAMEO_INTL_2015_WEALTH',
 'CAMEO_INTL_2015_LIFE_STAGE_TYP',
 'PLZ8_BAUMAX_FAMILY_HOMES']
In [183]:
# the purpose of the below is to undersntannd how best to impute missing values.
# bar graphs for columns with missing data
# look and see which are good candidates for imputation with most frequent value. Avg?
bar_graph_for_each_column(azdias_cleaned_encoded, list_of_columns_with_missing_data)
ALTERSKATEGORIE_GROB
     counts  percentage
3.0  310466   39.039363
4.0  223265   28.074325
2.0  137100   17.239558
1.0  124433   15.646754
HEALTH_TYP
     counts  percentage
3.0  307926   40.445214
2.0  297008   39.011166
1.0  156407   20.543620
RETOURTYP_BK_S
     counts  percentage
5.0  282167   35.567956
3.0  174283   21.968870
4.0  123422   15.557696
1.0  122787   15.477652
2.0   90659   11.427826
VERS_TYP
     counts  percentage
2.0  394238   51.782053
1.0  367103   48.217947
W_KEIT_KIND_HH
     counts  percentage
6.0  281963   38.166960
4.0  128673   17.417382
3.0   99515   13.470509
2.0   82043   11.105471
1.0   81853   11.079752
5.0   64715    8.759925
ANZ_HAUSHALTE_AKTIV
       counts  percentage
1.0    195955   24.754139
2.0    120980   15.282875
3.0     62575    7.904826
4.0     43213    5.458909
5.0     37815    4.777004
6.0     36020    4.550249
7.0     34526    4.361519
8.0     32293    4.079434
9.0     29002    3.663696
10.0    25428    3.212208
11.0    21964    2.774616
12.0    18033    2.278030
13.0    15282    1.930508
14.0    12625    1.594861
15.0    10371    1.310123
16.0     8899    1.124172
17.0     7292    0.921166
18.0     6324    0.798883
19.0     5461    0.689864
20.0     4674    0.590446
21.0     4138    0.522735
22.0     3735    0.471826
23.0     3243    0.409674
24.0     2838    0.358512
25.0     2636    0.332994
26.0     2342    0.295855
27.0     2232    0.281959
28.0     2040    0.257704
29.0     1963    0.247977
30.0     1821    0.230039
31.0     1634    0.206416
32.0     1616    0.204142
33.0     1551    0.195931
34.0     1437    0.181530
35.0     1320    0.166750
37.0     1317    0.166371
36.0     1281    0.161823
38.0     1255    0.158539
40.0     1115    0.140853
39.0     1113    0.140600
42.0     1043    0.131758
41.0     1031    0.130242
44.0      829    0.104724
43.0      819    0.103461
46.0      771    0.097397
45.0      686    0.086659
47.0      665    0.084007
48.0      662    0.083628
49.0      609    0.076932
50.0      548    0.069226
52.0      538    0.067963
51.0      474    0.059878
55.0      436    0.055078
54.0      436    0.055078
53.0      427    0.053941
56.0      399    0.050404
58.0      394    0.049772
57.0      376    0.047498
61.0      371    0.046867
59.0      346    0.043709
64.0      312    0.039414
62.0      308    0.038908
60.0      295    0.037266
63.0      291    0.036761
67.0      248    0.031329
70.0      248    0.031329
68.0      246    0.031076
73.0      245    0.030950
66.0      244    0.030823
72.0      226    0.028550
65.0      201    0.025391
69.0      192    0.024255
71.0      188    0.023749
77.0      182    0.022991
74.0      171    0.021602
75.0      171    0.021602
80.0      150    0.018949
76.0      149    0.018823
82.0      138    0.017433
79.0      136    0.017180
84.0      131    0.016549
85.0      131    0.016549
81.0      130    0.016422
78.0      128    0.016170
83.0      123    0.015538
91.0      123    0.015538
86.0      122    0.015412
87.0      119    0.015033
89.0      117    0.014780
93.0      116    0.014654
92.0      114    0.014401
88.0      107    0.013517
90.0      101    0.012759
97.0       99    0.012506
102.0      96    0.012127
98.0       91    0.011496
95.0       91    0.011496
103.0      90    0.011369
99.0       86    0.010864
110.0      84    0.010611
...       ...         ...
163.0      11    0.001390
348.0      11    0.001390
347.0      11    0.001390
194.0      10    0.001263
184.0      10    0.001263
218.0      10    0.001263
190.0      10    0.001263
258.0      10    0.001263
236.0      10    0.001263
151.0       9    0.001137
367.0       9    0.001137
172.0       9    0.001137
189.0       9    0.001137
274.0       9    0.001137
438.0       9    0.001137
277.0       9    0.001137
221.0       9    0.001137
209.0       9    0.001137
344.0       9    0.001137
219.0       9    0.001137
314.0       9    0.001137
185.0       8    0.001011
200.0       8    0.001011
207.0       8    0.001011
177.0       8    0.001011
321.0       8    0.001011
252.0       8    0.001011
318.0       8    0.001011
595.0       8    0.001011
333.0       8    0.001011
169.0       8    0.001011
286.0       7    0.000884
201.0       7    0.000884
242.0       7    0.000884
211.0       7    0.000884
241.0       7    0.000884
353.0       7    0.000884
191.0       7    0.000884
328.0       7    0.000884
267.0       7    0.000884
208.0       7    0.000884
346.0       7    0.000884
445.0       7    0.000884
206.0       7    0.000884
276.0       7    0.000884
259.0       6    0.000758
317.0       6    0.000758
283.0       6    0.000758
290.0       6    0.000758
204.0       6    0.000758
379.0       6    0.000758
243.0       6    0.000758
166.0       6    0.000758
304.0       6    0.000758
202.0       6    0.000758
430.0       6    0.000758
214.0       6    0.000758
193.0       6    0.000758
215.0       6    0.000758
377.0       5    0.000632
205.0       5    0.000632
231.0       5    0.000632
263.0       5    0.000632
229.0       5    0.000632
280.0       5    0.000632
240.0       5    0.000632
197.0       5    0.000632
228.0       5    0.000632
316.0       5    0.000632
331.0       4    0.000505
326.0       4    0.000505
250.0       4    0.000505
266.0       4    0.000505
174.0       4    0.000505
260.0       4    0.000505
256.0       4    0.000505
523.0       4    0.000505
255.0       4    0.000505
301.0       4    0.000505
249.0       4    0.000505
515.0       4    0.000505
285.0       4    0.000505
293.0       3    0.000379
224.0       3    0.000379
414.0       3    0.000379
168.0       3    0.000379
272.0       3    0.000379
395.0       3    0.000379
307.0       3    0.000379
244.0       3    0.000379
226.0       3    0.000379
378.0       3    0.000379
404.0       2    0.000253
237.0       2    0.000253
254.0       2    0.000253
213.0       2    0.000253
366.0       1    0.000126
232.0       1    0.000126
536.0       1    0.000126
220.0       1    0.000126

[291 rows x 2 columns]
ANZ_HH_TITEL
      counts  percentage
0.0   770239   96.982025
1.0    20157    2.538000
2.0     2459    0.309617
3.0      585    0.073658
4.0      232    0.029211
5.0      117    0.014732
6.0      106    0.013347
8.0       68    0.008562
7.0       65    0.008184
9.0       34    0.004281
13.0      29    0.003651
12.0      22    0.002770
11.0      22    0.002770
14.0      16    0.002015
10.0      16    0.002015
17.0      13    0.001637
20.0       9    0.001133
15.0       7    0.000881
18.0       6    0.000755
16.0       3    0.000378
23.0       3    0.000378
KONSUMNAEHE
     counts  percentage
1.0  188458   23.616439
3.0  166797   20.902011
5.0  150940   18.914905
2.0  131327   16.457121
4.0  130330   16.332183
6.0   26023    3.261048
7.0    4120    0.516294
KBA05_ANTG1
     counts  percentage
0.0  261049   34.443862
1.0  161224   21.272548
2.0  126725   16.720610
3.0  117762   15.537995
4.0   91137   12.024985
KBA05_ANTG2
     counts  percentage
0.0  292538   38.598649
1.0  163751   21.605970
2.0  138273   18.244300
3.0  134455   17.740537
4.0   28880    3.810544
KBA05_ANTG3
     counts  percentage
0.0  511545   67.495319
1.0   92748   12.237547
2.0   80234   10.586399
3.0   73370    9.680735
KBA05_ANTG4
     counts  percentage
0.0  600171   79.188993
1.0   83591   11.029335
2.0   74135    9.781672
KBA05_GBZ
     counts  percentage
3.0  197833   26.102887
5.0  158971   20.975278
4.0  155301   20.491043
2.0  138528   18.277945
1.0  107264   14.152847
BALLRAUM
     counts  percentage
6.0  255090   31.987210
1.0  151781   19.032697
2.0  104521   13.106492
7.0   99039   12.419073
3.0   73276    9.188501
4.0   61357    7.693909
5.0   52411    6.572118
EWDICHTE
     counts  percentage
6.0  201009   25.205680
5.0  161208   20.214803
2.0  139087   17.440923
4.0  130716   16.391235
1.0   84047   10.539139
3.0   81408   10.208220
INNENSTADT
     counts  percentage
5.0  147624   18.511427
4.0  134067   16.811436
6.0  111678   14.003950
2.0  109048   13.674159
3.0   92817   11.638860
8.0   82868   10.391298
7.0   67463    8.459576
1.0   51910    6.509295
GEBAEUDETYP_RASTER
     counts  percentage
4.0  359618   45.061524
3.0  205329   25.728517
5.0  159215   19.950254
2.0   58961    7.388041
1.0   14937    1.871664
KKK
     counts  percentage
3.0  273024   37.239500
2.0  181519   24.758544
4.0  178648   24.366950
1.0   99966   13.635006
MOBI_REGIO
     counts  percentage
1.0  163993   21.637901
3.0  150336   19.835941
5.0  148713   19.621796
4.0  148209   19.555296
2.0  146305   19.304074
6.0     341    0.044993
ONLINE_AFFINITAET
     counts  percentage
4.0  154959   19.533025
3.0  153679   19.371677
1.0  148126   18.671705
2.0  143459   18.083417
5.0  130440   16.442335
0.0   62655    7.897842
REGIOTYP
     counts  percentage
6.0  195286   26.636314
5.0  145359   19.826449
3.0   93929   12.811581
2.0   91662   12.502370
7.0   83943   11.449526
4.0   68180    9.299509
1.0   54798    7.474252
KBA13_ANZAHL_PKW
        counts  percentage
1400.0   11722    1.492450
1500.0    8291    1.055614
1300.0    6427    0.818288
1600.0    6135    0.781111
1700.0    3795    0.483181
1800.0    2617    0.333198
464.0     1604    0.204222
417.0     1604    0.204222
519.0     1600    0.203713
534.0     1496    0.190471
386.0     1458    0.185633
1900.0    1450    0.184615
395.0     1446    0.184105
481.0     1417    0.180413
455.0     1409    0.179394
483.0     1393    0.177357
452.0     1388    0.176721
418.0     1384    0.176211
454.0     1380    0.175702
450.0     1380    0.175702
494.0     1379    0.175575
459.0     1379    0.175575
492.0     1359    0.173028
504.0     1340    0.170609
387.0     1338    0.170355
420.0     1337    0.170227
439.0     1327    0.168954
506.0     1326    0.168827
388.0     1324    0.168572
456.0     1323    0.168445
487.0     1319    0.167936
402.0     1318    0.167808
421.0     1317    0.167681
499.0     1310    0.166790
466.0     1308    0.166535
491.0     1302    0.165771
490.0     1302    0.165771
558.0     1301    0.165644
477.0     1298    0.165262
489.0     1296    0.165007
567.0     1292    0.164498
536.0     1290    0.164243
406.0     1288    0.163989
516.0     1284    0.163479
393.0     1282    0.163225
453.0     1282    0.163225
389.0     1281    0.163097
390.0     1280    0.162970
556.0     1278    0.162715
584.0     1278    0.162715
438.0     1276    0.162461
574.0     1274    0.162206
485.0     1273    0.162079
537.0     1267    0.161315
517.0     1266    0.161188
479.0     1264    0.160933
508.0     1262    0.160678
497.0     1262    0.160678
377.0     1257    0.160042
478.0     1256    0.159914
500.0     1255    0.159787
352.0     1254    0.159660
572.0     1254    0.159660
375.0     1254    0.159660
467.0     1254    0.159660
446.0     1253    0.159532
409.0     1252    0.159405
451.0     1245    0.158514
429.0     1245    0.158514
384.0     1243    0.158259
515.0     1242    0.158132
426.0     1241    0.158005
434.0     1240    0.157877
518.0     1239    0.157750
470.0     1232    0.156859
554.0     1232    0.156859
488.0     1232    0.156859
471.0     1229    0.156477
369.0     1228    0.156349
382.0     1225    0.155968
360.0     1224    0.155840
442.0     1223    0.155713
410.0     1221    0.155458
502.0     1220    0.155331
430.0     1220    0.155331
399.0     1218    0.155076
396.0     1217    0.154949
345.0     1215    0.154694
509.0     1214    0.154567
428.0     1213    0.154440
565.0     1212    0.154312
380.0     1207    0.153676
475.0     1205    0.153421
597.0     1203    0.153166
549.0     1201    0.152912
412.0     1201    0.152912
469.0     1200    0.152784
530.0     1200    0.152784
457.0     1200    0.152784
437.0     1198    0.152530
...        ...         ...
1168.0     113    0.014387
1177.0     112    0.014260
76.0       111    0.014133
1198.0     107    0.013623
88.0       107    0.013623
1074.0     103    0.013114
1221.0     103    0.013114
69.0       101    0.012859
71.0        99    0.012605
1133.0      98    0.012477
100.0       97    0.012350
64.0        95    0.012095
1224.0      95    0.012095
80.0        94    0.011968
70.0        91    0.011586
73.0        89    0.011332
74.0        89    0.011332
1233.0      87    0.011077
1122.0      85    0.010822
1213.0      85    0.010822
75.0        85    0.010822
51.0        84    0.010695
78.0        84    0.010695
68.0        84    0.010695
63.0        80    0.010186
1093.0      78    0.009931
47.0        77    0.009804
59.0        76    0.009676
1247.0      76    0.009676
52.0        75    0.009549
1184.0      75    0.009549
77.0        74    0.009422
72.0        74    0.009422
1115.0      73    0.009294
1185.0      71    0.009040
66.0        69    0.008785
1098.0      67    0.008530
65.0        67    0.008530
62.0        66    0.008403
67.0        66    0.008403
0.0         62    0.007894
1225.0      61    0.007767
58.0        60    0.007639
45.0        59    0.007512
61.0        58    0.007385
1232.0      57    0.007257
60.0        54    0.006875
56.0        53    0.006748
55.0        53    0.006748
53.0        53    0.006748
48.0        53    0.006748
44.0        52    0.006621
54.0        50    0.006366
46.0        46    0.005857
57.0        46    0.005857
37.0        45    0.005729
41.0        44    0.005602
35.0        43    0.005475
34.0        42    0.005347
40.0        42    0.005347
36.0        38    0.004838
50.0        37    0.004711
42.0        35    0.004456
38.0        35    0.004456
33.0        31    0.003947
31.0        29    0.003692
39.0        28    0.003565
32.0        28    0.003565
49.0        27    0.003438
43.0        27    0.003438
28.0        24    0.003056
27.0        24    0.003056
25.0        23    0.002928
24.0        22    0.002801
26.0        21    0.002674
18.0        21    0.002674
17.0        20    0.002546
20.0        18    0.002292
21.0        17    0.002164
22.0        16    0.002037
12.0        16    0.002037
14.0        16    0.002037
29.0        15    0.001910
15.0        14    0.001782
23.0        13    0.001655
30.0        12    0.001528
16.0        11    0.001401
19.0        11    0.001401
13.0        10    0.001273
1.0          8    0.001019
10.0         8    0.001019
11.0         7    0.000891
5.0          7    0.000891
9.0          7    0.000891
4.0          7    0.000891
3.0          6    0.000764
8.0          6    0.000764
2.0          6    0.000764
7.0          5    0.000637
6.0          5    0.000637

[1261 rows x 2 columns]
PLZ8_ANTG1
     counts  percentage
2.0  270590   34.928089
3.0  222355   28.701856
1.0  189247   24.428235
4.0   87044   11.235746
0.0    5470    0.706074
PLZ8_ANTG2
     counts  percentage
3.0  307283   39.664466
2.0  215767   27.851469
4.0  191005   24.655160
1.0   53213    6.868799
0.0    7438    0.960106
PLZ8_ANTG3
     counts  percentage
2.0  252994   32.656776
1.0  237878   30.705584
3.0  164040   21.174484
0.0  119794   15.463156
PLZ8_ANTG4
     counts  percentage
0.0  356389   46.003129
1.0  294986   38.077154
2.0  123331   15.919717
PLZ8_HHZ
     counts  percentage
3.0  309146   39.904945
4.0  211911   27.353732
5.0  175813   22.694158
2.0   66891    8.634372
1.0   10945    1.412794
PLZ8_GBZ
     counts  percentage
3.0  288383   37.224831
4.0  180252   23.267149
5.0  153883   19.863406
2.0  111588   14.403916
1.0   40600    5.240698
ARBEIT
     counts  percentage
4.0  311337   39.219112
3.0  254987   32.120704
2.0  135661   17.089212
1.0   56766    7.150811
5.0   35089    4.420160
ORTSGR_KLS9
     counts  percentage
5.0  148095   18.653149
4.0  114909   14.473242
7.0  102866   12.956378
9.0   91878   11.572396
3.0   83539   10.522067
6.0   75995    9.571870
8.0   72709    9.157985
2.0   63361    7.980568
1.0   40589    5.112345
RELAT_AB
     counts  percentage
3.0  274005   34.516401
5.0  174963   22.040084
1.0  142906   18.001864
2.0  104846   13.207447
4.0   97120   12.234203
CAMEO_INTL_2015_WEALTH
   counts  percentage
5  223582   28.234828
2  190689   24.080968
4  189960   23.988907
1  119442   15.083613
3   68193    8.611684
CAMEO_INTL_2015_LIFE_STAGE_TYP
   counts  percentage
1  245054   30.946397
4  232777   29.396009
3  119692   15.115184
5  117044   14.780784
2   77299    9.761626
PLZ8_BAUMAX_FAMILY_HOMES
     counts  percentage
1.0  499550   64.482526
0.0   97333   12.563863
2.0   70407    9.088222
4.0   56684    7.316840
3.0   50732    6.548549

ANZ_HH_TITEL, 0 should have been replaced by NAN, but the feature into doc didn't say anything about 0s here (?)

In [184]:
azdias_cleaned_encoded.head()
Out[184]:
ALTERSKATEGORIE_GROB ANREDE_KZ FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER GREEN_AVANTGARDE HEALTH_TYP RETOURTYP_BK_S SEMIO_SOZ SEMIO_FAM SEMIO_REL SEMIO_MAT SEMIO_VERT SEMIO_LUST SEMIO_ERL SEMIO_KULT SEMIO_RAT SEMIO_KRIT SEMIO_DOM SEMIO_KAEM SEMIO_PFLICHT SEMIO_TRADV SOHO_KZ VERS_TYP ANZ_PERSONEN ANZ_TITEL HH_EINKOMMEN_SCORE W_KEIT_KIND_HH WOHNDAUER_2008 ANZ_HAUSHALTE_AKTIV ANZ_HH_TITEL KONSUMNAEHE MIN_GEBAEUDEJAHR KBA05_ANTG1 KBA05_ANTG2 KBA05_ANTG3 KBA05_ANTG4 KBA05_GBZ BALLRAUM EWDICHTE INNENSTADT GEBAEUDETYP_RASTER KKK MOBI_REGIO ONLINE_AFFINITAET REGIOTYP KBA13_ANZAHL_PKW PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB OST_WEST_KZ_O PRAEGENDE_JUGENDJAHRE_DECADE PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM CAMEO_INTL_2015_WEALTH CAMEO_INTL_2015_LIFE_STAGE_TYP PLZ8_BAUMAX_BLDNG_TYPE_BUSINESS PLZ8_BAUMAX_BLDNG_TYPE_FAMILY PLZ8_BAUMAX_FAMILY_HOMES WOHNLAGE_RURAL_FLAG WOHNLAGE_CITY_NEIGHBOURHOOD
1 1.0 2.0 1.0 5.0 2.0 5.0 4.0 5.0 0.0 3.0 1.0 5.0 4.0 4.0 3.0 1.0 2.0 2.0 3.0 6.0 4.0 7.0 4.0 7.0 6.0 1.0 2.0 2.0 0.0 6.0 3.0 9.0 11.0 0.0 1.0 1992.0 0.0 0.0 0.0 2.0 1.0 6.0 3.0 8.0 3.0 2.0 1.0 3.0 3.0 963.0 2.0 3.0 2.0 1.0 5.0 4.0 3.0 5.0 4.0 0 90 0 1 5 1 0 1 1.0 0 4.0
2 3.0 2.0 1.0 4.0 1.0 2.0 3.0 5.0 1.0 3.0 3.0 4.0 1.0 3.0 3.0 4.0 4.0 6.0 3.0 4.0 7.0 7.0 7.0 3.0 3.0 0.0 1.0 1.0 0.0 4.0 3.0 9.0 10.0 0.0 5.0 1992.0 1.0 3.0 1.0 0.0 3.0 2.0 4.0 4.0 4.0 2.0 3.0 2.0 2.0 712.0 3.0 3.0 1.0 0.0 4.0 4.0 3.0 5.0 2.0 0 90 1 0 2 4 0 1 1.0 0 2.0
3 4.0 2.0 4.0 2.0 5.0 2.0 1.0 2.0 0.0 2.0 2.0 5.0 1.0 2.0 1.0 4.0 4.0 7.0 4.0 3.0 4.0 4.0 5.0 4.0 4.0 0.0 1.0 0.0 0.0 1.0 NaN 9.0 1.0 0.0 4.0 1997.0 4.0 1.0 0.0 0.0 4.0 4.0 2.0 6.0 4.0 NaN 4.0 1.0 NaN 596.0 2.0 2.0 2.0 0.0 3.0 4.0 2.0 3.0 3.0 0 70 0 1 1 2 0 1 1.0 1 0.0
4 3.0 1.0 4.0 3.0 4.0 1.0 3.0 2.0 0.0 3.0 5.0 6.0 4.0 4.0 2.0 7.0 4.0 4.0 6.0 2.0 3.0 2.0 2.0 4.0 2.0 0.0 2.0 4.0 0.0 5.0 2.0 9.0 3.0 0.0 4.0 1992.0 1.0 4.0 1.0 0.0 3.0 2.0 5.0 1.0 5.0 3.0 3.0 5.0 5.0 435.0 2.0 4.0 2.0 1.0 3.0 3.0 4.0 6.0 5.0 0 70 0 1 4 3 0 1 2.0 0 3.0
5 1.0 2.0 3.0 1.0 5.0 2.0 2.0 5.0 0.0 3.0 3.0 2.0 4.0 7.0 4.0 2.0 2.0 2.0 5.0 7.0 4.0 4.0 4.0 7.0 6.0 0.0 2.0 1.0 0.0 5.0 6.0 9.0 5.0 0.0 5.0 1992.0 2.0 2.0 0.0 0.0 4.0 6.0 2.0 7.0 4.0 4.0 4.0 1.0 5.0 1300.0 2.0 3.0 1.0 1.0 5.0 5.0 2.0 3.0 3.0 0 50 0 1 5 4 0 1 1.0 1 0.0
In [185]:
# Imputer produces an array, not a dataframe. I'll lose my columns I'll need to rebuild the df. 
# So I need to save my column names
columns_list = list(azdias_cleaned_encoded.columns)
print(type(columns_list))
columns_list
<class 'list'>
Out[185]:
['ALTERSKATEGORIE_GROB',
 'ANREDE_KZ',
 'FINANZ_MINIMALIST',
 'FINANZ_SPARER',
 'FINANZ_VORSORGER',
 'FINANZ_ANLEGER',
 'FINANZ_UNAUFFAELLIGER',
 'FINANZ_HAUSBAUER',
 'GREEN_AVANTGARDE',
 'HEALTH_TYP',
 'RETOURTYP_BK_S',
 'SEMIO_SOZ',
 'SEMIO_FAM',
 'SEMIO_REL',
 'SEMIO_MAT',
 'SEMIO_VERT',
 'SEMIO_LUST',
 'SEMIO_ERL',
 'SEMIO_KULT',
 'SEMIO_RAT',
 'SEMIO_KRIT',
 'SEMIO_DOM',
 'SEMIO_KAEM',
 'SEMIO_PFLICHT',
 'SEMIO_TRADV',
 'SOHO_KZ',
 'VERS_TYP',
 'ANZ_PERSONEN',
 'ANZ_TITEL',
 'HH_EINKOMMEN_SCORE',
 'W_KEIT_KIND_HH',
 'WOHNDAUER_2008',
 'ANZ_HAUSHALTE_AKTIV',
 'ANZ_HH_TITEL',
 'KONSUMNAEHE',
 'MIN_GEBAEUDEJAHR',
 'KBA05_ANTG1',
 'KBA05_ANTG2',
 'KBA05_ANTG3',
 'KBA05_ANTG4',
 'KBA05_GBZ',
 'BALLRAUM',
 'EWDICHTE',
 'INNENSTADT',
 'GEBAEUDETYP_RASTER',
 'KKK',
 'MOBI_REGIO',
 'ONLINE_AFFINITAET',
 'REGIOTYP',
 'KBA13_ANZAHL_PKW',
 'PLZ8_ANTG1',
 'PLZ8_ANTG2',
 'PLZ8_ANTG3',
 'PLZ8_ANTG4',
 'PLZ8_HHZ',
 'PLZ8_GBZ',
 'ARBEIT',
 'ORTSGR_KLS9',
 'RELAT_AB',
 'OST_WEST_KZ_O',
 'PRAEGENDE_JUGENDJAHRE_DECADE',
 'PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE',
 'PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM',
 'CAMEO_INTL_2015_WEALTH',
 'CAMEO_INTL_2015_LIFE_STAGE_TYP',
 'PLZ8_BAUMAX_BLDNG_TYPE_BUSINESS',
 'PLZ8_BAUMAX_BLDNG_TYPE_FAMILY',
 'PLZ8_BAUMAX_FAMILY_HOMES',
 'WOHNLAGE_RURAL_FLAG',
 'WOHNLAGE_CITY_NEIGHBOURHOOD']
In [186]:
"""
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html
https://stackoverflow.com/questions/25017626/predicting-missing-values-with-scikit-learns-imputer-module
https://stackoverflow.com/questions/52384806/imputer-on-some-columns-in-a-dataframe
"""

imputer = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)

azdias_cleaned_encoded_imputed = imputer.fit_transform(azdias_cleaned_encoded)

print(type(azdias_cleaned_encoded_imputed))

azdias_cleaned_encoded_imputed = pd.DataFrame(azdias_cleaned_encoded_imputed,columns=columns_list)

print(type(azdias_cleaned_encoded_imputed))

azdias_cleaned_encoded_imputed.head()
<class 'numpy.ndarray'>
<class 'pandas.core.frame.DataFrame'>
Out[186]:
ALTERSKATEGORIE_GROB ANREDE_KZ FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER GREEN_AVANTGARDE HEALTH_TYP RETOURTYP_BK_S SEMIO_SOZ SEMIO_FAM SEMIO_REL SEMIO_MAT SEMIO_VERT SEMIO_LUST SEMIO_ERL SEMIO_KULT SEMIO_RAT SEMIO_KRIT SEMIO_DOM SEMIO_KAEM SEMIO_PFLICHT SEMIO_TRADV SOHO_KZ VERS_TYP ANZ_PERSONEN ANZ_TITEL HH_EINKOMMEN_SCORE W_KEIT_KIND_HH WOHNDAUER_2008 ANZ_HAUSHALTE_AKTIV ANZ_HH_TITEL KONSUMNAEHE MIN_GEBAEUDEJAHR KBA05_ANTG1 KBA05_ANTG2 KBA05_ANTG3 KBA05_ANTG4 KBA05_GBZ BALLRAUM EWDICHTE INNENSTADT GEBAEUDETYP_RASTER KKK MOBI_REGIO ONLINE_AFFINITAET REGIOTYP KBA13_ANZAHL_PKW PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB OST_WEST_KZ_O PRAEGENDE_JUGENDJAHRE_DECADE PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM CAMEO_INTL_2015_WEALTH CAMEO_INTL_2015_LIFE_STAGE_TYP PLZ8_BAUMAX_BLDNG_TYPE_BUSINESS PLZ8_BAUMAX_BLDNG_TYPE_FAMILY PLZ8_BAUMAX_FAMILY_HOMES WOHNLAGE_RURAL_FLAG WOHNLAGE_CITY_NEIGHBOURHOOD
0 1.0 2.0 1.0 5.0 2.0 5.0 4.0 5.0 0.0 3.0 1.0 5.0 4.0 4.0 3.0 1.0 2.0 2.0 3.0 6.0 4.0 7.0 4.0 7.0 6.0 1.0 2.0 2.0 0.0 6.0 3.0 9.0 11.0 0.0 1.0 1992.0 0.0 0.0 0.0 2.0 1.0 6.0 3.0 8.0 3.0 2.0 1.0 3.0 3.0 963.0 2.0 3.0 2.0 1.0 5.0 4.0 3.0 5.0 4.0 0.0 90.0 0.0 1.0 5.0 1.0 0.0 1.0 1.0 0.0 4.0
1 3.0 2.0 1.0 4.0 1.0 2.0 3.0 5.0 1.0 3.0 3.0 4.0 1.0 3.0 3.0 4.0 4.0 6.0 3.0 4.0 7.0 7.0 7.0 3.0 3.0 0.0 1.0 1.0 0.0 4.0 3.0 9.0 10.0 0.0 5.0 1992.0 1.0 3.0 1.0 0.0 3.0 2.0 4.0 4.0 4.0 2.0 3.0 2.0 2.0 712.0 3.0 3.0 1.0 0.0 4.0 4.0 3.0 5.0 2.0 0.0 90.0 1.0 0.0 2.0 4.0 0.0 1.0 1.0 0.0 2.0
2 4.0 2.0 4.0 2.0 5.0 2.0 1.0 2.0 0.0 2.0 2.0 5.0 1.0 2.0 1.0 4.0 4.0 7.0 4.0 3.0 4.0 4.0 5.0 4.0 4.0 0.0 1.0 0.0 0.0 1.0 6.0 9.0 1.0 0.0 4.0 1997.0 4.0 1.0 0.0 0.0 4.0 4.0 2.0 6.0 4.0 3.0 4.0 1.0 6.0 596.0 2.0 2.0 2.0 0.0 3.0 4.0 2.0 3.0 3.0 0.0 70.0 0.0 1.0 1.0 2.0 0.0 1.0 1.0 1.0 0.0
3 3.0 1.0 4.0 3.0 4.0 1.0 3.0 2.0 0.0 3.0 5.0 6.0 4.0 4.0 2.0 7.0 4.0 4.0 6.0 2.0 3.0 2.0 2.0 4.0 2.0 0.0 2.0 4.0 0.0 5.0 2.0 9.0 3.0 0.0 4.0 1992.0 1.0 4.0 1.0 0.0 3.0 2.0 5.0 1.0 5.0 3.0 3.0 5.0 5.0 435.0 2.0 4.0 2.0 1.0 3.0 3.0 4.0 6.0 5.0 0.0 70.0 0.0 1.0 4.0 3.0 0.0 1.0 2.0 0.0 3.0
4 1.0 2.0 3.0 1.0 5.0 2.0 2.0 5.0 0.0 3.0 3.0 2.0 4.0 7.0 4.0 2.0 2.0 2.0 5.0 7.0 4.0 4.0 4.0 7.0 6.0 0.0 2.0 1.0 0.0 5.0 6.0 9.0 5.0 0.0 5.0 1992.0 2.0 2.0 0.0 0.0 4.0 6.0 2.0 7.0 4.0 4.0 4.0 1.0 5.0 1300.0 2.0 3.0 1.0 1.0 5.0 5.0 2.0 3.0 3.0 0.0 50.0 0.0 1.0 5.0 4.0 0.0 1.0 1.0 1.0 0.0
In [187]:
# number of cols with nan
azdias_cleaned_encoded_imputed.isnull().any().sum()  
Out[187]:
0
In [188]:
# we now should have no NANs
azdias_cleaned_encoded_imputed.isnull().sum()  
Out[188]:
ALTERSKATEGORIE_GROB                         0
ANREDE_KZ                                    0
FINANZ_MINIMALIST                            0
FINANZ_SPARER                                0
FINANZ_VORSORGER                             0
FINANZ_ANLEGER                               0
FINANZ_UNAUFFAELLIGER                        0
FINANZ_HAUSBAUER                             0
GREEN_AVANTGARDE                             0
HEALTH_TYP                                   0
RETOURTYP_BK_S                               0
SEMIO_SOZ                                    0
SEMIO_FAM                                    0
SEMIO_REL                                    0
SEMIO_MAT                                    0
SEMIO_VERT                                   0
SEMIO_LUST                                   0
SEMIO_ERL                                    0
SEMIO_KULT                                   0
SEMIO_RAT                                    0
SEMIO_KRIT                                   0
SEMIO_DOM                                    0
SEMIO_KAEM                                   0
SEMIO_PFLICHT                                0
SEMIO_TRADV                                  0
SOHO_KZ                                      0
VERS_TYP                                     0
ANZ_PERSONEN                                 0
ANZ_TITEL                                    0
HH_EINKOMMEN_SCORE                           0
W_KEIT_KIND_HH                               0
WOHNDAUER_2008                               0
ANZ_HAUSHALTE_AKTIV                          0
ANZ_HH_TITEL                                 0
KONSUMNAEHE                                  0
MIN_GEBAEUDEJAHR                             0
KBA05_ANTG1                                  0
KBA05_ANTG2                                  0
KBA05_ANTG3                                  0
KBA05_ANTG4                                  0
KBA05_GBZ                                    0
BALLRAUM                                     0
EWDICHTE                                     0
INNENSTADT                                   0
GEBAEUDETYP_RASTER                           0
KKK                                          0
MOBI_REGIO                                   0
ONLINE_AFFINITAET                            0
REGIOTYP                                     0
KBA13_ANZAHL_PKW                             0
PLZ8_ANTG1                                   0
PLZ8_ANTG2                                   0
PLZ8_ANTG3                                   0
PLZ8_ANTG4                                   0
PLZ8_HHZ                                     0
PLZ8_GBZ                                     0
ARBEIT                                       0
ORTSGR_KLS9                                  0
RELAT_AB                                     0
OST_WEST_KZ_O                                0
PRAEGENDE_JUGENDJAHRE_DECADE                 0
PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE    0
PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM    0
CAMEO_INTL_2015_WEALTH                       0
CAMEO_INTL_2015_LIFE_STAGE_TYP               0
PLZ8_BAUMAX_BLDNG_TYPE_BUSINESS              0
PLZ8_BAUMAX_BLDNG_TYPE_FAMILY                0
PLZ8_BAUMAX_FAMILY_HOMES                     0
WOHNLAGE_RURAL_FLAG                          0
WOHNLAGE_CITY_NEIGHBOURHOOD                  0
dtype: int64
In [189]:
# checking how mode imputation worked

print(azdias_cleaned_encoded['KKK'].isna().sum())
print(azdias_cleaned_encoded['KKK'].value_counts())

print(azdias_cleaned_encoded_imputed['KKK'].isna().sum())
print(azdias_cleaned_encoded_imputed['KKK'].value_counts())
64910
3.0    273024
2.0    181519
4.0    178648
1.0     99966
Name: KKK, dtype: int64
0
3.0    337934
2.0    181519
4.0    178648
1.0     99966
Name: KKK, dtype: int64
In [190]:
# KKK
64910 + 273024
Out[190]:
337934
In [191]:
len(list_of_columns_with_missing_data)
Out[191]:
34
In [192]:
# checking how mode imputation worked
# NAN count + mode frequency in the initial dataset = mode frequency in IMPUTED dataset

x = 10 # give a number from 0 to 33. This will indicate which col with missing value you're checking

test = list_of_columns_with_missing_data[x]

print("checking column:")
print(test)
print()


print('Before imputation:')
print('NAN', azdias_cleaned_encoded[test].isna().sum())
print(azdias_cleaned_encoded[test].value_counts())
print()

print('After imputation:')
print('NAN', azdias_cleaned_encoded_imputed[test].isna().sum())
print(azdias_cleaned_encoded_imputed[test].value_counts())

print()
if azdias_cleaned_encoded[test].isna().sum() + max(azdias_cleaned_encoded[test].value_counts())== max(azdias_cleaned_encoded_imputed[test].value_counts()):
    print('All good :) ')
    print(azdias_cleaned_encoded[test].isna().sum(), '+', max(azdias_cleaned_encoded[test].value_counts()), '==', max(azdias_cleaned_encoded_imputed[test].value_counts()))
else:
    print('Smth is wrong :( )')
checking column:
KBA05_ANTG3

Before imputation:
NAN 40170
0.0    511545
1.0     92748
2.0     80234
3.0     73370
Name: KBA05_ANTG3, dtype: int64

After imputation:
NAN 0
0.0    551715
1.0     92748
2.0     80234
3.0     73370
Name: KBA05_ANTG3, dtype: int64

All good :) 
40170 + 511545 == 551715
In [193]:
# PLZ8_BAUMAX_FAMILY_HOMES
23361 + 499550
Out[193]:
522911
In [194]:
# ALTERSKATEGORIE_GROB
2803 + 310466
Out[194]:
313269

Apply feature scaling to the general population demographics data.

image.png

In [195]:
"""
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
https://stackoverflow.com/questions/40758562/can-anyone-explain-me-standardscaler
https://stackoverflow.com/questions/24645153/pandas-dataframe-columns-scaling-with-sklearn
"""

scaler = StandardScaler()

azdias_scaled = scaler.fit_transform(azdias_cleaned_encoded_imputed)

print(type(azdias_scaled))

azdias_scaled = pd.DataFrame(azdias_scaled, columns=columns_list)

print(type(azdias_scaled))

azdias_scaled.head()
<class 'numpy.ndarray'>
<class 'pandas.core.frame.DataFrame'>
Out[195]:
ALTERSKATEGORIE_GROB ANREDE_KZ FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER GREEN_AVANTGARDE HEALTH_TYP RETOURTYP_BK_S SEMIO_SOZ SEMIO_FAM SEMIO_REL SEMIO_MAT SEMIO_VERT SEMIO_LUST SEMIO_ERL SEMIO_KULT SEMIO_RAT SEMIO_KRIT SEMIO_DOM SEMIO_KAEM SEMIO_PFLICHT SEMIO_TRADV SOHO_KZ VERS_TYP ANZ_PERSONEN ANZ_TITEL HH_EINKOMMEN_SCORE W_KEIT_KIND_HH WOHNDAUER_2008 ANZ_HAUSHALTE_AKTIV ANZ_HH_TITEL KONSUMNAEHE MIN_GEBAEUDEJAHR KBA05_ANTG1 KBA05_ANTG2 KBA05_ANTG3 KBA05_ANTG4 KBA05_GBZ BALLRAUM EWDICHTE INNENSTADT GEBAEUDETYP_RASTER KKK MOBI_REGIO ONLINE_AFFINITAET REGIOTYP KBA13_ANZAHL_PKW PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB OST_WEST_KZ_O PRAEGENDE_JUGENDJAHRE_DECADE PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM CAMEO_INTL_2015_WEALTH CAMEO_INTL_2015_LIFE_STAGE_TYP PLZ8_BAUMAX_BLDNG_TYPE_BUSINESS PLZ8_BAUMAX_BLDNG_TYPE_FAMILY PLZ8_BAUMAX_FAMILY_HOMES WOHNLAGE_RURAL_FLAG WOHNLAGE_CITY_NEIGHBOURHOOD
0 -1.766647 0.957912 -1.494594 1.537920 -1.040686 1.465965 0.958633 1.339319 -0.530407 1.010156 -1.685452 0.443205 -0.059355 0.002834 -0.463909 -1.684507 -1.109913 -1.435372 -0.578164 1.274185 -0.312196 1.339262 -0.157565 1.518699 1.288987 10.85417 0.922940 0.234458 -0.060408 1.026720 -0.730192 0.567332 0.173104 -0.125133 -1.304533 -0.383167 -1.008836 -0.965690 -0.594874 2.730674 -1.659274 0.845251 -0.547161 1.701105 -0.799742 -0.791629 -1.279712 0.166966 -0.882469 0.942501 -0.256528 0.211982 0.402977 0.442535 1.455855 0.574309 -0.171399 -0.127042 0.684885 -0.517425 0.976547 -0.530407 0.585967 1.175655 -1.248888 -0.372695 0.422113 -0.304992 -0.552785 0.977376
1 0.200522 0.957912 -1.494594 0.864560 -1.766972 -0.570999 0.244109 1.339319 1.885345 1.010156 -0.310902 -0.072013 -1.626994 -0.520587 -0.463909 -0.142554 -0.158741 0.754262 -0.578164 0.064233 1.391992 1.339262 1.448745 -0.638942 -0.410210 -0.09213 -1.083494 -0.630198 -0.060408 -0.267573 -0.730192 0.567332 0.109101 -0.125133 1.274844 -0.383167 -0.297919 1.444756 0.408133 -0.464084 -0.116192 -0.986687 0.034482 -0.271086 0.283466 -0.791629 0.092849 -0.476480 -1.435271 0.227651 0.786631 0.211982 -0.623111 -0.934791 0.419746 0.574309 -0.171399 -0.127042 -0.789025 -0.517425 0.976547 1.885345 -1.706581 -0.869682 0.767097 -0.372695 0.422113 -0.304992 -0.552785 -0.238713
2 1.184107 0.957912 0.683145 -0.482159 1.138172 -0.570999 -1.184938 -0.791197 -0.530407 -0.311824 -0.998177 0.443205 -1.626994 -1.044008 -1.509281 -0.142554 -0.158741 1.301671 -0.067376 -0.540743 -0.312196 -0.303542 0.377872 -0.099532 0.156189 -0.09213 -1.083494 -1.494855 -0.060408 -2.209012 0.957029 0.567332 -0.466933 -0.125133 0.630000 1.117198 1.834830 -0.162208 -0.594874 -0.464084 0.655349 -0.070718 -1.128805 0.715010 0.283466 0.269703 0.779129 -1.119927 0.775936 -0.102718 -0.256528 -0.890125 0.402977 -0.934791 -0.616363 0.574309 -1.173149 -0.997550 -0.052070 -0.517425 -0.034268 -0.530407 0.585967 -1.551461 -0.576893 -0.372695 0.422113 -0.304992 1.809022 -1.454802
3 0.200522 -1.043937 0.683145 0.191200 0.411886 -1.249987 0.244109 -0.791197 -0.530407 1.010156 1.063648 0.958423 -0.059355 0.002834 -0.986595 1.399399 -0.158741 -0.340555 0.954200 -1.145720 -0.880259 -1.398745 -1.228438 -0.099532 -0.976609 -0.09213 0.922940 1.963772 -0.060408 0.379574 -1.292599 0.567332 -0.338925 -0.125133 0.630000 -0.383167 -0.297919 2.248238 0.408133 -0.464084 -0.116192 -0.986687 0.616126 -1.750229 1.366673 0.269703 0.092849 1.453859 0.223134 -0.561248 -0.256528 1.314088 0.402977 0.442535 -0.616363 -0.337194 0.830351 0.308212 1.421840 -0.517425 -0.034268 -0.530407 0.585967 0.493876 0.095102 -0.372695 0.422113 0.690159 -0.552785 0.369332
4 -1.766647 0.957912 -0.042768 -1.155519 1.138172 -0.570999 -0.470414 1.339319 -0.530407 1.010156 -0.310902 -1.102449 -0.059355 1.573097 0.058776 -1.170522 -1.109913 -1.435372 0.443412 1.879161 -0.312196 -0.303542 -0.157565 1.518699 1.288987 -0.09213 0.922940 -0.630198 -0.060408 0.379574 0.957029 0.567332 -0.210918 -0.125133 1.274844 -0.383167 0.412997 0.641274 -0.594874 -0.464084 0.655349 0.845251 -1.128805 1.208058 0.283466 1.331035 0.779129 -1.119927 0.223134 1.902281 -0.256528 0.211982 -0.623111 0.442535 1.455855 1.485812 -1.173149 -0.997550 -0.052070 -0.517425 -1.045083 -0.530407 0.585967 1.175655 0.767097 -0.372695 0.422113 -0.304992 1.809022 -1.454802
In [196]:
azdias_cleaned_encoded_imputed.describe().transpose()
Out[196]:
count mean std min 25% 50% 75% max
ALTERSKATEGORIE_GROB 798067.0 2.796131 1.016690 1.0 2.0 3.0 4.0 4.0
ANREDE_KZ 798067.0 1.521486 0.499538 1.0 1.0 2.0 2.0 2.0
FINANZ_MINIMALIST 798067.0 3.058916 1.377576 1.0 2.0 3.0 4.0 5.0
FINANZ_SPARER 798067.0 2.716050 1.485091 1.0 1.0 3.0 4.0 5.0
FINANZ_VORSORGER 798067.0 3.432887 1.376869 1.0 2.0 4.0 5.0 5.0
FINANZ_ANLEGER 798067.0 2.840956 1.472781 1.0 1.0 3.0 4.0 5.0
FINANZ_UNAUFFAELLIGER 798067.0 2.658361 1.399534 1.0 1.0 2.0 4.0 5.0
FINANZ_HAUSBAUER 798067.0 3.114092 1.408110 1.0 2.0 3.0 4.0 5.0
GREEN_AVANTGARDE 798067.0 0.219562 0.413950 0.0 0.0 0.0 0.0 1.0
HEALTH_TYP 798067.0 2.235876 0.756442 1.0 2.0 2.0 3.0 3.0
RETOURTYP_BK_S 798067.0 3.452369 1.455023 1.0 2.0 4.0 5.0 5.0
SEMIO_SOZ 798067.0 4.139771 1.940928 1.0 2.0 4.0 6.0 7.0
SEMIO_FAM 798067.0 4.113588 1.913707 1.0 2.0 4.0 6.0 7.0
SEMIO_REL 798067.0 3.994586 1.910509 1.0 3.0 4.0 5.0 7.0
SEMIO_MAT 798067.0 3.887550 1.913197 1.0 2.0 4.0 5.0 7.0
SEMIO_VERT 798067.0 4.277350 1.945586 1.0 2.0 5.0 6.0 7.0
SEMIO_LUST 798067.0 4.333779 2.102670 1.0 2.0 5.0 6.0 7.0
SEMIO_ERL 798067.0 4.622122 1.826790 1.0 3.0 4.0 6.0 7.0
SEMIO_KULT 798067.0 4.131906 1.957760 1.0 3.0 4.0 6.0 7.0
SEMIO_RAT 798067.0 3.893826 1.652959 1.0 3.0 4.0 5.0 7.0
SEMIO_KRIT 798067.0 4.549580 1.760370 1.0 3.0 5.0 6.0 7.0
SEMIO_DOM 798067.0 4.554312 1.826147 1.0 3.0 5.0 6.0 7.0
SEMIO_KAEM 798067.0 4.294274 1.867635 1.0 3.0 4.0 6.0 7.0
SEMIO_PFLICHT 798067.0 4.184520 1.853878 1.0 3.0 4.0 6.0 7.0
SEMIO_TRADV 798067.0 3.724242 1.765541 1.0 2.0 4.0 5.0 7.0
SOHO_KZ 798067.0 0.008417 0.091355 0.0 0.0 0.0 0.0 1.0
VERS_TYP 798067.0 1.540010 0.498397 1.0 1.0 2.0 2.0 2.0
ANZ_PERSONEN 798067.0 1.728842 1.156529 0.0 1.0 1.0 2.0 45.0
ANZ_TITEL 798067.0 0.004161 0.068887 0.0 0.0 0.0 0.0 6.0
HH_EINKOMMEN_SCORE 798067.0 4.413465 1.545246 1.0 3.0 5.0 6.0 6.0
W_KEIT_KIND_HH 798067.0 4.298335 1.778073 1.0 3.0 5.0 6.0 6.0
WOHNDAUER_2008 798067.0 7.908934 1.923155 1.0 8.0 9.0 9.0 9.0
ANZ_HAUSHALTE_AKTIV 798067.0 8.295401 15.624098 1.0 1.0 4.0 9.0 595.0
ANZ_HH_TITEL 798067.0 0.040450 0.323257 0.0 0.0 0.0 0.0 23.0
KONSUMNAEHE 798067.0 3.023021 1.550763 1.0 2.0 3.0 4.0 7.0
MIN_GEBAEUDEJAHR 798067.0 1993.276914 3.332524 1985.0 1992.0 1992.0 1993.0 2016.0
KBA05_ANTG1 798067.0 1.419064 1.406636 0.0 0.0 1.0 3.0 4.0
KBA05_ANTG2 798067.0 1.201882 1.244584 0.0 0.0 1.0 2.0 4.0
KBA05_ANTG3 798067.0 0.593091 0.997003 0.0 0.0 0.0 1.0 3.0
KBA05_ANTG4 798067.0 0.290528 0.626026 0.0 0.0 0.0 0.0 2.0
KBA05_GBZ 798067.0 3.150598 1.296109 1.0 2.0 3.0 4.0 5.0
BALLRAUM 798067.0 4.154412 2.183481 1.0 2.0 5.0 6.0 7.0
EWDICHTE 798067.0 3.940716 1.719267 1.0 2.0 4.0 6.0 6.0
INNENSTADT 798067.0 4.549816 2.028202 1.0 3.0 5.0 6.0 8.0
GEBAEUDETYP_RASTER 798067.0 3.738309 0.923185 1.0 3.0 4.0 4.0 5.0
KKK 798067.0 2.745882 0.942213 1.0 2.0 3.0 3.0 4.0
MOBI_REGIO 798067.0 2.864707 1.457131 1.0 1.0 3.0 4.0 6.0
ONLINE_AFFINITAET 798067.0 2.740513 1.554132 0.0 1.0 3.0 4.0 5.0
REGIOTYP 798067.0 4.596357 1.808968 1.0 3.0 5.0 6.0 7.0
KBA13_ANZAHL_PKW 798067.0 632.066753 351.122507 0.0 386.0 554.0 794.0 2300.0
PLZ8_ANTG1 798067.0 2.245914 0.958627 0.0 2.0 2.0 3.0 4.0
PLZ8_ANTG2 798067.0 2.807658 0.907354 0.0 2.0 3.0 3.0 4.0
PLZ8_ANTG3 798067.0 1.607269 0.974576 0.0 1.0 2.0 2.0 3.0
PLZ8_ANTG4 798067.0 0.678700 0.726045 0.0 0.0 1.0 1.0 2.0
PLZ8_HHZ 798067.0 3.594882 0.965150 1.0 3.0 3.0 4.0 5.0
PLZ8_GBZ 798067.0 3.369931 1.097090 1.0 3.0 3.0 4.0 5.0
ARBEIT 798067.0 3.171100 0.998254 1.0 3.0 3.0 4.0 5.0
ORTSGR_KLS9 798067.0 5.291880 2.297511 1.0 4.0 5.0 7.0 9.0
RELAT_AB 798067.0 3.070656 1.356935 1.0 2.0 3.0 4.0 5.0
OST_WEST_KZ_O 798067.0 0.211188 0.408152 0.0 0.0 0.0 0.0 1.0
PRAEGENDE_JUGENDJAHRE_DECADE 798067.0 70.678026 19.786022 0.0 60.0 70.0 90.0 90.0
PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE 798067.0 0.219562 0.413950 0.0 0.0 0.0 0.0 1.0
PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM 798067.0 0.744404 0.436196 0.0 0.0 1.0 1.0 1.0
CAMEO_INTL_2015_WEALTH 798067.0 3.275607 1.466752 1.0 2.0 4.0 5.0 5.0
CAMEO_INTL_2015_LIFE_STAGE_TYP 798067.0 2.858478 1.488107 1.0 1.0 3.0 4.0 5.0
PLZ8_BAUMAX_BLDNG_TYPE_BUSINESS 798067.0 0.121961 0.327241 0.0 0.0 0.0 0.0 1.0
PLZ8_BAUMAX_BLDNG_TYPE_FAMILY 798067.0 0.848767 0.358276 0.0 1.0 1.0 1.0 1.0
PLZ8_BAUMAX_FAMILY_HOMES 798067.0 1.306478 1.004873 0.0 1.0 1.0 1.0 4.0
WOHNLAGE_RURAL_FLAG 798067.0 0.234052 0.423405 0.0 0.0 0.0 0.0 1.0
WOHNLAGE_CITY_NEIGHBOURHOOD 798067.0 2.392591 1.644617 0.0 1.0 3.0 4.0 5.0
In [197]:
# mean close to zero. Stdev is close to 1
azdias_scaled.describe().transpose()
Out[197]:
count mean std min 25% 50% 75% max
ALTERSKATEGORIE_GROB 798067.0 -1.320003e-16 1.000001 -1.766647 -0.783063 0.200522 1.184107 1.184107
ANREDE_KZ 798067.0 3.258161e-16 1.000001 -1.043937 -1.043937 0.957912 0.957912 0.957912
FINANZ_MINIMALIST 798067.0 5.657155e-17 1.000001 -1.494594 -0.768681 -0.042768 0.683145 1.409058
FINANZ_SPARER 798067.0 -1.007675e-16 1.000001 -1.155519 -1.155519 0.191200 0.864560 1.537920
FINANZ_VORSORGER 798067.0 1.164640e-16 1.000001 -1.766972 -1.040686 0.411886 1.138172 1.138172
FINANZ_ANLEGER 798067.0 1.377340e-17 1.000001 -1.249987 -1.249987 0.107989 0.786977 1.465965
FINANZ_UNAUFFAELLIGER 798067.0 4.154278e-17 1.000001 -1.184938 -1.184938 -0.470414 0.958633 1.673157
FINANZ_HAUSBAUER 798067.0 -1.895334e-16 1.000001 -1.501369 -0.791197 -0.081025 0.629147 1.339319
GREEN_AVANTGARDE 798067.0 -1.104899e-17 1.000001 -0.530407 -0.530407 -0.530407 -0.530407 1.885345
HEALTH_TYP 798067.0 2.394809e-16 1.000001 -1.633804 -0.311824 -0.311824 1.010156 1.010156
RETOURTYP_BK_S 798067.0 1.348849e-18 1.000001 -1.685452 -0.998177 0.376373 1.063648 1.063648
SEMIO_SOZ 798067.0 8.304550e-17 1.000001 -1.617667 -1.102449 -0.072013 0.958423 1.473641
SEMIO_FAM 798067.0 6.091636e-17 1.000001 -1.626994 -1.104448 -0.059355 0.985737 1.508284
SEMIO_REL 798067.0 1.940919e-18 1.000001 -1.567429 -0.520587 0.002834 0.526255 1.573097
SEMIO_MAT 798067.0 6.134371e-18 1.000001 -1.509281 -0.986595 0.058776 0.581462 1.626833
SEMIO_VERT 798067.0 2.115601e-16 1.000001 -1.684507 -1.170522 0.371431 0.885415 1.399399
SEMIO_LUST 798067.0 2.120498e-16 1.000001 -1.585499 -1.109913 0.316845 0.792432 1.268018
SEMIO_ERL 798067.0 1.769085e-16 1.000001 -1.982781 -0.887964 -0.340555 0.754262 1.301671
SEMIO_KULT 798067.0 -1.305891e-16 1.000001 -1.599741 -0.578164 -0.067376 0.954200 1.464988
SEMIO_RAT 798067.0 -1.729198e-16 1.000001 -1.750696 -0.540743 0.064233 0.669209 1.879161
SEMIO_KRIT 798067.0 5.967880e-17 1.000001 -2.016384 -0.880259 0.255867 0.823929 1.391992
SEMIO_DOM 798067.0 -2.576792e-16 1.000001 -1.946346 -0.851143 0.244059 0.791661 1.339262
SEMIO_KAEM 798067.0 2.900694e-17 1.000001 -1.763875 -0.693002 -0.157565 0.913309 1.448745
SEMIO_PFLICHT 798067.0 -1.856694e-16 1.000001 -1.717763 -0.638942 -0.099532 0.979289 1.518699
SEMIO_TRADV 798067.0 -1.355082e-16 1.000001 -1.543009 -0.976609 0.156189 0.722588 1.855386
SOHO_KZ 798067.0 -1.736143e-18 1.000001 -0.092130 -0.092130 -0.092130 -0.092130 10.854170
VERS_TYP 798067.0 -7.072779e-17 1.000001 -1.083494 -1.083494 0.922940 0.922940 0.922940
ANZ_PERSONEN 798067.0 1.207287e-16 1.000001 -1.494855 -0.630198 -0.630198 0.234458 37.414694
ANZ_TITEL 798067.0 3.648571e-17 1.000001 -0.060408 -0.060408 -0.060408 -0.060408 87.038546
HH_EINKOMMEN_SCORE 798067.0 2.181842e-16 1.000001 -2.209012 -0.914719 0.379574 1.026720 1.026720
W_KEIT_KIND_HH 798067.0 8.087755e-17 1.000001 -1.855006 -0.730192 0.394622 0.957029 0.957029
WOHNDAUER_2008 798067.0 1.798466e-17 1.000001 -3.592502 0.047353 0.567332 0.567332 0.567332
ANZ_HAUSHALTE_AKTIV 798067.0 -3.774998e-17 1.000001 -0.466933 -0.466933 -0.274922 0.045097 37.551286
ANZ_HH_TITEL 798067.0 2.740435e-17 1.000001 -0.125133 -0.125133 -0.125133 -0.125133 71.025640
KONSUMNAEHE 798067.0 7.888321e-17 1.000001 -1.304533 -0.659689 -0.014845 0.630000 2.564533
MIN_GEBAEUDEJAHR 798067.0 -1.373679e-14 1.000001 -2.483678 -0.383167 -0.383167 -0.083094 6.818585
KBA05_ANTG1 798067.0 -1.070176e-16 1.000001 -1.008836 -1.008836 -0.297919 1.123914 1.834830
KBA05_ANTG2 798067.0 -3.984225e-19 1.000001 -0.965690 -0.965690 -0.162208 0.641274 2.248238
KBA05_ANTG3 798067.0 -7.874966e-17 1.000001 -0.594874 -0.594874 -0.594874 0.408133 2.414147
KBA05_ANTG4 798067.0 -5.667839e-17 1.000001 -0.464084 -0.464084 -0.464084 -0.464084 2.730674
KBA05_GBZ 798067.0 5.794266e-17 1.000001 -1.659274 -0.887733 -0.116192 0.655349 1.426889
BALLRAUM 798067.0 2.276573e-16 1.000001 -1.444672 -0.986687 0.387266 0.845251 1.303236
EWDICHTE 798067.0 1.134280e-16 1.000001 -1.710448 -1.128805 0.034482 1.197770 1.197770
INNENSTADT 798067.0 5.666503e-17 1.000001 -1.750229 -0.764133 0.221962 0.715010 1.701105
GEBAEUDETYP_RASTER 798067.0 -1.568672e-16 1.000001 -2.966157 -0.799742 0.283466 0.283466 1.366673
KKK 798067.0 -1.006785e-16 1.000001 -1.852961 -0.791629 0.269703 0.269703 1.331035
MOBI_REGIO 798067.0 1.470647e-16 1.000001 -1.279712 -1.279712 0.092849 0.779129 2.151690
ONLINE_AFFINITAET 798067.0 6.094752e-17 1.000001 -1.763373 -1.119927 0.166966 0.810412 1.453859
REGIOTYP 798067.0 8.563191e-17 1.000001 -1.988073 -0.882469 0.223134 0.775936 1.328738
KBA13_ANZAHL_PKW 798067.0 1.169893e-17 1.000001 -1.800133 -0.700801 -0.222335 0.461188 4.750292
PLZ8_ANTG1 798067.0 -1.589238e-16 1.000001 -2.342845 -0.256528 -0.256528 0.786631 1.829790
PLZ8_ANTG2 798067.0 -2.472980e-16 1.000001 -3.094337 -0.890125 0.211982 0.211982 1.314088
PLZ8_ANTG3 798067.0 1.575884e-18 1.000001 -1.649199 -0.623111 0.402977 0.402977 1.429065
PLZ8_ANTG4 798067.0 4.257557e-17 1.000001 -0.934791 -0.934791 0.442535 0.442535 1.819861
PLZ8_HHZ 798067.0 -9.509611e-17 1.000001 -2.688581 -0.616363 -0.616363 0.419746 1.455855
PLZ8_GBZ 798067.0 -1.752703e-16 1.000001 -2.160199 -0.337194 -0.337194 0.574309 1.485812
ARBEIT 798067.0 1.365766e-17 1.000001 -2.174899 -0.171399 -0.171399 0.830351 1.832100
ORTSGR_KLS9 798067.0 6.716647e-17 1.000001 -1.868058 -0.562296 -0.127042 0.743466 1.613974
RELAT_AB 798067.0 -1.462812e-17 1.000001 -1.525981 -0.789025 -0.052070 0.684885 1.421840
OST_WEST_KZ_O 798067.0 1.602593e-18 1.000001 -0.517425 -0.517425 -0.517425 -0.517425 1.932646
PRAEGENDE_JUGENDJAHRE_DECADE 798067.0 9.853279e-17 1.000001 -3.572121 -0.539676 -0.034268 0.976547 0.976547
PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE 798067.0 -1.104899e-17 1.000001 -0.530407 -0.530407 -0.530407 -0.530407 1.885345
PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM 798067.0 4.075929e-17 1.000001 -1.706581 -1.706581 0.585967 0.585967 0.585967
CAMEO_INTL_2015_WEALTH 798067.0 5.445256e-17 1.000001 -1.551461 -0.869682 0.493876 1.175655 1.175655
CAMEO_INTL_2015_LIFE_STAGE_TYP 798067.0 -3.221213e-17 1.000001 -1.248888 -1.248888 0.095102 0.767097 1.439092
PLZ8_BAUMAX_BLDNG_TYPE_BUSINESS 798067.0 -1.727240e-17 1.000001 -0.372695 -0.372695 -0.372695 -0.372695 2.683160
PLZ8_BAUMAX_BLDNG_TYPE_FAMILY 798067.0 8.582778e-18 1.000001 -2.369033 0.422113 0.422113 0.422113 0.422113
PLZ8_BAUMAX_FAMILY_HOMES 798067.0 3.340517e-17 1.000001 -1.300144 -0.304992 -0.304992 -0.304992 2.680462
WOHNLAGE_RURAL_FLAG 798067.0 -3.703771e-18 1.000001 -0.552785 -0.552785 -0.552785 -0.552785 1.809022
WOHNLAGE_CITY_NEIGHBOURHOOD 798067.0 1.278781e-16 1.000001 -1.454802 -0.846757 0.369332 0.977376 1.585421

Discussion 2.1: Apply Feature Scaling

(Double-click this cell and replace this text with your own text, reporting your decisions regarding feature scaling.)

Imputed columns with most frequent value, because a lot of them they have a more or less prominent most frequent value (with a high frequency relative to other levels, without a close number two). I appears that mode is a goood option to impute. Many columns have a good pronounced mode, with no close 2nd most freq value. Not all columns have that. You can argue that maybe median may be better for some columns (e.g., KBA13_ANZAHL_PKW), but just to keep things simple I'll use mode.

Scaled data with StandardScaler.

Step 2.2: Perform Dimensionality Reduction

On your scaled data, you are now ready to apply dimensionality reduction techniques.

  • Use sklearn's PCA class to apply principal component analysis on the data, thus finding the vectors of maximal variance in the data. To start, you should not set any parameters (so all components are computed) or set a number of components that is at least half the number of features (so there's enough features to see the general trend in variability).
  • Check out the ratio of variance explained by each principal component as well as the cumulative variance explained. Try plotting the cumulative or sequential values using matplotlib's plot() function. Based on what you find, select a value for the number of transformed features you'll retain for the clustering part of the project.
  • Once you've made a choice for the number of components to keep, make sure you re-fit a PCA instance to perform the decided-on transformation.
In [198]:
# Apply PCA to the data.

def do_pca(data_scaled,n_components=None):
    '''
    
    Source: Udacity Nanodegree Unsupervised Learning 4 Dimensionality Reduction and PCA 
    Assumes data is already scaled.

    INPUT: n_components - int - the number of principal components to create
           data - the data you would like to transform

    OUTPUT: pca - the pca object created after fitting the data
            X_pca - the transformed X matrix with new number of components
    '''
    pca = PCA(n_components,random_state=42)
    X_pca = pca.fit_transform(data_scaled)
    return pca, X_pca
In [199]:
pca, X_pca = do_pca(azdias_scaled)
In [200]:
print(azdias_scaled.shape)
print(X_pca.shape)
(798067, 70)
(798067, 70)
In [201]:
type(X_pca)
Out[201]:
numpy.ndarray
In [202]:
X_pca
Out[202]:
array([[  3.40924403e+00,  -2.98838568e+00,  -3.12516794e+00, ...,
         -1.36618198e-01,  -6.44366539e-02,  -9.46553317e-15],
       [ -1.20625506e+00,   6.41781896e-01,  -2.70946954e+00, ...,
          9.19622464e-02,   1.52014813e-01,  -1.12218904e-14],
       [ -4.02314639e+00,   1.73019209e+00,  -7.35136675e-01, ...,
         -6.63697460e-01,   1.98517770e-01,   4.04347317e-15],
       ..., 
       [ -2.01092829e+00,  -3.15269527e+00,  -3.15103351e+00, ...,
         -1.23762901e-01,   9.82314928e-02,  -6.75218026e-17],
       [  6.13439976e+00,  -3.98322369e+00,   2.22490845e+00, ...,
          1.14138153e-01,  -5.23150612e-02,   2.15489491e-17],
       [ -3.75399180e-01,   8.63149953e-01,   2.85234550e+00, ...,
         -1.98221526e-01,  -1.72061349e-01,   3.30086049e-16]])
In [203]:
pca
Out[203]:
PCA(copy=True, iterated_power='auto', n_components=None, random_state=42,
  svd_solver='auto', tol=0.0, whiten=False)
In [204]:
# np.arange(0,1.1, 0.1)
In [205]:
# turn off scientific notation
# Credit:
# https://twitter.com/vboykis
pd.set_option('display.float_format', lambda x: '%.3f' % x)
In [206]:
# look at each PCA and how much variance it explains
num_components=len(pca.explained_variance_ratio_)
ind = np.arange(num_components)+1
vals = pca.explained_variance_ratio_
cumvals = np.cumsum(vals)
test_pca_df = pd.DataFrame({'pca': ind, 'vals': vals, 'cumvals': cumvals})
test_pca_df
Out[206]:
pca vals cumvals
0 1 0.166 0.166
1 2 0.121 0.287
2 3 0.087 0.374
3 4 0.059 0.433
4 5 0.037 0.470
5 6 0.034 0.505
6 7 0.027 0.531
7 8 0.025 0.557
8 9 0.024 0.581
9 10 0.022 0.603
10 11 0.021 0.623
11 12 0.020 0.643
12 13 0.019 0.662
13 14 0.018 0.680
14 15 0.017 0.697
15 16 0.016 0.713
16 17 0.016 0.728
17 18 0.014 0.742
18 19 0.013 0.756
19 20 0.013 0.768
20 21 0.012 0.781
21 22 0.012 0.793
22 23 0.012 0.805
23 24 0.011 0.815
24 25 0.011 0.826
25 26 0.010 0.836
26 27 0.010 0.846
27 28 0.009 0.855
28 29 0.007 0.862
29 30 0.007 0.869
30 31 0.007 0.876
31 32 0.007 0.883
32 33 0.006 0.889
33 34 0.006 0.895
34 35 0.006 0.901
35 36 0.006 0.907
36 37 0.006 0.912
37 38 0.005 0.918
38 39 0.005 0.923
39 40 0.005 0.928
40 41 0.004 0.932
41 42 0.004 0.936
42 43 0.004 0.941
43 44 0.004 0.944
44 45 0.004 0.948
45 46 0.004 0.952
46 47 0.004 0.955
47 48 0.003 0.959
48 49 0.003 0.962
49 50 0.003 0.965
50 51 0.003 0.968
51 52 0.003 0.971
52 53 0.003 0.974
53 54 0.002 0.976
54 55 0.002 0.979
55 56 0.002 0.981
56 57 0.002 0.983
57 58 0.002 0.985
58 59 0.002 0.987
59 60 0.002 0.989
60 61 0.002 0.991
61 62 0.002 0.992
62 63 0.002 0.994
63 64 0.001 0.996
64 65 0.001 0.997
65 66 0.001 0.998
66 67 0.001 0.999
67 68 0.001 1.000
68 69 0.000 1.000
69 70 0.000 1.000
In [207]:
def look_at_pca_and_variance(x):
    y = min(np.where(np.cumsum(pca.explained_variance_ratio_)>=x)[0]+1)
    print('To explain', x, 'of variance, we need', y, 'PCAs')
    return y
look_at_pca_and_variance(0.6)    
To explain 0.6 of variance, we need 10 PCAs
Out[207]:
10
In [208]:
look_at_pca_and_variance(0.7)  
To explain 0.7 of variance, we need 16 PCAs
Out[208]:
16
In [209]:
look_at_pca_and_variance(0.8)  
To explain 0.8 of variance, we need 23 PCAs
Out[209]:
23

- Starting from PCA 27, they explain less than 1% each. They are noise

- We need 23 PCAs to explain 80% of variance

- PCA reduces many dimensions to fewer dimensions. It is a linear projection

- It's done so k means runs faster and so we have less noise (our model will generalize better)

- If there's non-linearity, it won't be captured by K means

- tSNE could be used as an alternative to capture non-linearity

In [210]:
# num_components=len(pca.explained_variance_ratio_)
# ind = np.arange(num_components)+1
# ind
In [211]:
# np.arange(0,71, 5)
In [212]:
# Investigate the variance accounted for by each principal component.

def scree_plot(pca):
    '''
    Source: Udacity Nanodegree Unsupervised Learning 4 Dimensionality Reduction and PCA 
    https://stackoverflow.com/questions/12608788/changing-the-tick-frequency-on-x-or-y-axis-in-matplotlib
    
    Creates a scree plot associated with the principal components 
    
    INPUT: pca - the result of instantian of PCA in scikit learn
            
    OUTPUT:
            None
    '''
    num_components=len(pca.explained_variance_ratio_)
    ind = np.arange(num_components)+1
    vals = pca.explained_variance_ratio_
 
    plt.figure(figsize=(15,10))
    ax = plt.subplot(111)
    cumvals = np.cumsum(vals)
    ax.bar(ind, vals)
    ax.plot(ind, cumvals)
    for i in range(num_components):
        ax.annotate(r"%s%%" % ((str(vals[i]*100)[:4])), (ind[i]+0.2, vals[i]), va="bottom", ha="center", fontsize=12)
 
    ax.xaxis.set_tick_params(width=0)
    ax.yaxis.set_tick_params(width=2, length=12)

    plt.yticks(np.arange(0,1.1, 0.1)) # I added this to see every 10% on the y axis
    plt.xticks(np.arange(0,71, 5))
    
    # draw lines to understand how many PCAs I would need
    X = 0.6
    Y = look_at_pca_and_variance(X)
    plt.hlines(y=X, xmin=0, xmax=Y, color='red', linestyles='dashed',zorder=1)
    plt.vlines(x=Y, ymin=0, ymax=X, color='red', linestyles='dashed',zorder=2)
    
    X = 0.7
    Y = look_at_pca_and_variance(X)
    plt.hlines(y=X, xmin=0, xmax=Y, color='red', linestyles='dashed',zorder=3)
    plt.vlines(x=Y, ymin=0, ymax=X, color='red', linestyles='dashed',zorder=4)
    
    X = 0.8
    Y = look_at_pca_and_variance(X)
    plt.hlines(y=X, xmin=0, xmax=Y, color='red', linestyles='dashed',zorder=3)
    plt.vlines(x=Y, ymin=0, ymax=X, color='red', linestyles='dashed',zorder=4)


    ax.set_xlabel("Principal Component")
    ax.set_ylabel("Variance Explained (%)")
    plt.title('Explained Variance Per Principal Component')

scree_plot(pca)    
To explain 0.6 of variance, we need 10 PCAs
To explain 0.7 of variance, we need 16 PCAs
To explain 0.8 of variance, we need 23 PCAs
In [213]:
# Percentage of variance explained 
pca.explained_variance_ratio_.sum()
Out[213]:
1.0000000000000002
In [214]:
# Re-apply PCA to the data while selecting for number of components to retain.

pca, X_pca = do_pca(azdias_scaled,n_components=23)
print(azdias_scaled.shape)
print(X_pca.shape)
scree_plot(pca)  
(798067, 70)
(798067, 23)
To explain 0.6 of variance, we need 10 PCAs
To explain 0.7 of variance, we need 16 PCAs
To explain 0.8 of variance, we need 23 PCAs
In [215]:
# Percentage of variance explained 
pca.explained_variance_ratio_.sum()
Out[215]:
0.80447380248239619
In [216]:
type(azdias_scaled)
Out[216]:
pandas.core.frame.DataFrame

Discussion 2.2: Perform Dimensionality Reduction

(Double-click this cell and replace this text with your own text, reporting your findings and decisions regarding dimensionality reduction. How many principal components / transformed features are you retaining for the next step of the analysis?)

Step 2.3: Interpret Principal Components

Now that we have our transformed principal components, it's a nice idea to check out the weight of each variable on the first few components to see if they can be interpreted in some fashion.

As a reminder, each principal component is a unit vector that points in the direction of highest variance (after accounting for the variance captured by earlier principal components). The further a weight is from zero, the more the principal component is in the direction of the corresponding feature. If two features have large weights of the same sign (both positive or both negative), then increases in one tend expect to be associated with increases in the other. To contrast, features with different signs can be expected to show a negative correlation: increases in one variable should result in a decrease in the other.

  • To investigate the features, you should map each weight to their corresponding feature name, then sort the features according to weight. The most interesting features for each principal component, then, will be those at the beginning and end of the sorted list. Use the data dictionary document to help you understand these most prominent features, their relationships, and what a positive or negative value on the principal component might indicate.
  • You should investigate and interpret feature associations from the first three principal components in this substep. To help facilitate this, you should write a function that you can call at any time to print the sorted list of feature weights, for the i-th principal component. This might come in handy in the next step of the project, when you interpret the tendencies of the discovered clusters.
In [217]:
# Map weights for the first principal component to corresponding feature names
# and then print the linked values, sorted by weight.
# HINT: Try defining a function here or in a new cell that you can reuse in the
# other cells.

def pca_results(full_dataset, pca,number):
	'''
    Source: Udacity Nanodegree Unsupervised Learning 4 Dimensionality Reduction and PCA 
    
    https://stackoverflow.com/questions/4700614/how-to-put-the-legend-out-of-the-plot
    
    https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.bar.html
    
	Create a DataFrame of the PCA results
	Includes dimension feature weights and explained variance
	Visualizes the PCA results
	'''
	number_minus_1 = number-1

	# Dimension indexing
	dimensions = dimensions = ['Dimension {}'.format(i) for i in range(1,len(pca.components_)+1)]

	# PCA components
	components = pd.DataFrame(np.round(pca.components_, 4), columns = full_dataset.keys())
	components.index = dimensions

	# PCA explained variance
	ratios = pca.explained_variance_ratio_.reshape(len(pca.components_), 1)
	variance_ratios = pd.DataFrame(np.round(ratios, 4), columns = ['Explained Variance'])
	variance_ratios.index = dimensions

	# Create a bar plot visualization
	fig, ax = plt.subplots(figsize = (10,15))

	# Plot the feature weights as a function of the components
	components.iloc[[number_minus_1]].plot(ax = ax, kind = 'bar', width =3);
	ax.set_ylabel("Feature Weights")
	#create ticks for X axis
	dimension_of_interest = []    
	dimension_of_interest.append(dimensions[number_minus_1])
	ax.set_xticklabels(dimension_of_interest, rotation=0) 
    
	ax.set_xlabel("Dataset Variables")
	# title
	title = "Feature Weights for Dimension " + str(number)
	ax.set_title(title)

	# Display the explained variance ratios
	ax.text(-0.1, ax.get_ylim()[1] + 0.05,s="Explained Variance\n          %.4f"%(pca.explained_variance_ratio_[number_minus_1]))
    
	ax.legend(bbox_to_anchor=(1.1, 1.05)) # move legend outta the graph
    
	return pd.concat([variance_ratios, components], axis = 1)
In [218]:
my_pca_results = pca_results(azdias_scaled, pca, 1)
In [219]:
# Map weights for the second principal component to corresponding feature names
# and then print the linked values, sorted by weight.

my_pca_results = pca_results(azdias_scaled, pca, 2)
In [220]:
# Map weights for the third principal component to corresponding feature names
# and then print the linked values, sorted by weight.

my_pca_results = pca_results(azdias_scaled, pca, 3)
In [221]:
def plot_pca(data, pca, n_compo):
    '''
	The above visualizations are too large to understand. Vis below only shows top vars.
    Adapted from: 
    https://github.com/MiguelAMartinez/identify-customer-segments-Arvato/blob/master/Identify_Customer_Segments.ipynb
    Plot the features with the most absolute variance for given pca component.
	'''
    
    compo = pd.DataFrame(pca.components_, columns = data.keys()).iloc[n_compo-1]
    compo.sort_values(ascending=False, inplace=True)
    compo = pd.concat([compo.head(7), compo.tail(7)])
    
    fig, ax = plt.subplots(figsize = (10,7))
    
    compo.plot(kind='barh', title='Component ' + str(n_compo))
    
    ax = plt.gca()
    ax.grid(linewidth='0.5', alpha=0.5)
    ax.set_axisbelow(True)
    plt.show()
In [222]:
plot_pca(azdias_scaled, pca,1) 
In [223]:
# I wasn't sure if i should supply a scaled dataset or the non-scated one. It doesn't matter cus the results will be equal
# my_pca_results_1 = pca_results(azdias_cleaned_encoded_imputed, pca, 1)
# my_pca_results_2 = pca_results(azdias_scaled, pca, 1)
# my_pca_results_1.equals(my_pca_results_2)
# True
In [224]:
plot_pca(azdias_scaled, pca,2) 
In [225]:
plot_pca(azdias_scaled, pca,3) 
In [226]:
my_pca_results.head()
# temp_df = my_pca_results.head()
# temp_df.to_excel("feature_weights.xlsx")
Out[226]:
Explained Variance ALTERSKATEGORIE_GROB ANREDE_KZ FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER GREEN_AVANTGARDE HEALTH_TYP RETOURTYP_BK_S SEMIO_SOZ SEMIO_FAM SEMIO_REL SEMIO_MAT SEMIO_VERT SEMIO_LUST SEMIO_ERL SEMIO_KULT SEMIO_RAT SEMIO_KRIT SEMIO_DOM SEMIO_KAEM SEMIO_PFLICHT SEMIO_TRADV SOHO_KZ VERS_TYP ANZ_PERSONEN ANZ_TITEL HH_EINKOMMEN_SCORE W_KEIT_KIND_HH WOHNDAUER_2008 ANZ_HAUSHALTE_AKTIV ANZ_HH_TITEL KONSUMNAEHE MIN_GEBAEUDEJAHR KBA05_ANTG1 KBA05_ANTG2 KBA05_ANTG3 KBA05_ANTG4 KBA05_GBZ BALLRAUM EWDICHTE INNENSTADT GEBAEUDETYP_RASTER KKK MOBI_REGIO ONLINE_AFFINITAET REGIOTYP KBA13_ANZAHL_PKW PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB OST_WEST_KZ_O PRAEGENDE_JUGENDJAHRE_DECADE PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM CAMEO_INTL_2015_WEALTH CAMEO_INTL_2015_LIFE_STAGE_TYP PLZ8_BAUMAX_BLDNG_TYPE_BUSINESS PLZ8_BAUMAX_BLDNG_TYPE_FAMILY PLZ8_BAUMAX_FAMILY_HOMES WOHNLAGE_RURAL_FLAG WOHNLAGE_CITY_NEIGHBOURHOOD
Dimension 1 0.166 -0.082 0.011 -0.194 0.108 -0.079 0.035 0.043 0.161 -0.101 0.030 0.004 0.033 0.057 0.078 0.056 -0.039 -0.048 -0.049 0.047 0.066 0.013 0.028 0.046 0.081 0.056 -0.002 0.034 -0.079 -0.003 0.193 0.049 -0.049 0.136 0.036 -0.170 -0.046 -0.208 0.011 0.132 0.154 -0.212 -0.131 0.203 -0.170 -0.120 0.037 -0.220 -0.059 0.058 -0.076 -0.227 0.154 0.227 0.220 0.043 -0.169 0.146 0.207 0.138 0.053 0.030 -0.101 0.086 0.201 -0.118 0.150 -0.136 0.070 -0.155 0.173
Dimension 2 0.121 0.277 0.098 0.095 -0.251 0.244 -0.208 -0.230 0.091 0.007 -0.060 0.165 -0.117 -0.201 -0.271 -0.177 -0.073 0.196 0.247 -0.237 -0.180 0.122 0.077 0.116 -0.242 -0.245 -0.002 0.030 -0.048 0.008 0.021 0.105 0.069 0.041 0.026 -0.045 -0.039 -0.033 -0.005 0.022 0.046 -0.051 -0.043 0.065 -0.051 -0.027 -0.009 -0.038 -0.155 -0.001 -0.023 -0.054 0.040 0.057 0.056 0.010 -0.043 0.045 0.064 0.045 0.014 -0.211 0.007 -0.022 0.039 0.022 0.032 -0.023 0.022 -0.052 0.042
Dimension 3 0.086 0.091 -0.363 0.169 -0.116 0.109 -0.201 -0.101 -0.052 0.078 -0.035 0.114 0.256 0.239 0.058 0.047 0.343 0.075 -0.166 0.223 -0.225 -0.272 -0.307 -0.330 -0.089 -0.089 0.000 0.004 -0.003 0.013 -0.040 0.081 0.037 0.020 0.015 -0.040 -0.016 -0.006 -0.009 -0.001 0.023 -0.015 -0.042 0.054 -0.048 -0.033 -0.029 -0.013 -0.051 -0.019 -0.020 -0.041 0.028 0.042 0.043 0.009 -0.033 0.033 0.055 0.033 0.008 -0.074 0.078 -0.069 0.014 -0.001 0.041 -0.036 0.000 -0.041 0.036
Dimension 4 0.059 -0.053 0.064 0.043 0.021 -0.037 -0.070 0.093 -0.096 0.388 0.011 -0.018 -0.018 -0.035 0.004 0.008 -0.045 -0.006 -0.010 -0.041 0.069 0.031 0.125 0.092 0.034 0.062 0.002 0.017 0.094 0.041 -0.223 -0.089 0.007 -0.035 0.025 -0.110 -0.025 0.105 -0.000 -0.079 -0.036 0.066 -0.180 0.211 -0.170 -0.067 -0.216 0.075 0.128 -0.171 0.026 -0.017 0.064 0.059 0.050 0.098 0.032 0.044 0.212 0.077 -0.097 0.045 0.388 -0.365 -0.119 0.052 0.089 -0.082 -0.073 -0.207 0.112
Dimension 5 0.037 0.011 0.017 0.007 -0.002 -0.035 0.008 0.001 -0.013 -0.027 -0.027 -0.016 -0.016 -0.003 -0.021 0.021 0.001 0.023 0.004 -0.051 0.011 0.020 -0.024 -0.021 -0.031 0.017 -0.001 -0.050 -0.019 0.032 -0.065 0.001 -0.044 0.183 0.136 -0.055 0.124 -0.040 -0.221 -0.160 0.216 -0.054 0.003 -0.101 0.021 -0.228 -0.170 -0.078 0.020 -0.125 -0.028 -0.014 -0.185 -0.065 0.046 -0.061 -0.097 -0.165 -0.098 -0.177 0.097 -0.004 -0.027 0.018 -0.055 -0.028 0.390 -0.407 -0.393 0.185 -0.158
In [227]:
my_pca_results
Out[227]:
Explained Variance ALTERSKATEGORIE_GROB ANREDE_KZ FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER GREEN_AVANTGARDE HEALTH_TYP RETOURTYP_BK_S SEMIO_SOZ SEMIO_FAM SEMIO_REL SEMIO_MAT SEMIO_VERT SEMIO_LUST SEMIO_ERL SEMIO_KULT SEMIO_RAT SEMIO_KRIT SEMIO_DOM SEMIO_KAEM SEMIO_PFLICHT SEMIO_TRADV SOHO_KZ VERS_TYP ANZ_PERSONEN ANZ_TITEL HH_EINKOMMEN_SCORE W_KEIT_KIND_HH WOHNDAUER_2008 ANZ_HAUSHALTE_AKTIV ANZ_HH_TITEL KONSUMNAEHE MIN_GEBAEUDEJAHR KBA05_ANTG1 KBA05_ANTG2 KBA05_ANTG3 KBA05_ANTG4 KBA05_GBZ BALLRAUM EWDICHTE INNENSTADT GEBAEUDETYP_RASTER KKK MOBI_REGIO ONLINE_AFFINITAET REGIOTYP KBA13_ANZAHL_PKW PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB OST_WEST_KZ_O PRAEGENDE_JUGENDJAHRE_DECADE PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM CAMEO_INTL_2015_WEALTH CAMEO_INTL_2015_LIFE_STAGE_TYP PLZ8_BAUMAX_BLDNG_TYPE_BUSINESS PLZ8_BAUMAX_BLDNG_TYPE_FAMILY PLZ8_BAUMAX_FAMILY_HOMES WOHNLAGE_RURAL_FLAG WOHNLAGE_CITY_NEIGHBOURHOOD
Dimension 1 0.166 -0.082 0.011 -0.194 0.108 -0.079 0.035 0.043 0.161 -0.101 0.030 0.004 0.033 0.057 0.078 0.056 -0.039 -0.048 -0.049 0.047 0.066 0.013 0.028 0.046 0.081 0.056 -0.002 0.034 -0.079 -0.003 0.193 0.049 -0.049 0.136 0.036 -0.170 -0.046 -0.208 0.011 0.132 0.154 -0.212 -0.131 0.203 -0.170 -0.120 0.037 -0.220 -0.059 0.058 -0.076 -0.227 0.154 0.227 0.220 0.043 -0.169 0.146 0.207 0.138 0.053 0.030 -0.101 0.086 0.201 -0.118 0.150 -0.136 0.070 -0.155 0.173
Dimension 2 0.121 0.277 0.098 0.095 -0.251 0.244 -0.208 -0.230 0.091 0.007 -0.060 0.165 -0.117 -0.201 -0.271 -0.177 -0.073 0.196 0.247 -0.237 -0.180 0.122 0.077 0.116 -0.242 -0.245 -0.002 0.030 -0.048 0.008 0.021 0.105 0.069 0.041 0.026 -0.045 -0.039 -0.033 -0.005 0.022 0.046 -0.051 -0.043 0.065 -0.051 -0.027 -0.009 -0.038 -0.155 -0.001 -0.023 -0.054 0.040 0.057 0.056 0.010 -0.043 0.045 0.064 0.045 0.014 -0.211 0.007 -0.022 0.039 0.022 0.032 -0.023 0.022 -0.052 0.042
Dimension 3 0.086 0.091 -0.363 0.169 -0.116 0.109 -0.201 -0.101 -0.052 0.078 -0.035 0.114 0.256 0.239 0.058 0.047 0.343 0.075 -0.166 0.223 -0.225 -0.272 -0.307 -0.330 -0.089 -0.089 0.000 0.004 -0.003 0.013 -0.040 0.081 0.037 0.020 0.015 -0.040 -0.016 -0.006 -0.009 -0.001 0.023 -0.015 -0.042 0.054 -0.048 -0.033 -0.029 -0.013 -0.051 -0.019 -0.020 -0.041 0.028 0.042 0.043 0.009 -0.033 0.033 0.055 0.033 0.008 -0.074 0.078 -0.069 0.014 -0.001 0.041 -0.036 0.000 -0.041 0.036
Dimension 4 0.059 -0.053 0.064 0.043 0.021 -0.037 -0.070 0.093 -0.096 0.388 0.011 -0.018 -0.018 -0.035 0.004 0.008 -0.045 -0.006 -0.010 -0.041 0.069 0.031 0.125 0.092 0.034 0.062 0.002 0.017 0.094 0.041 -0.223 -0.089 0.007 -0.035 0.025 -0.110 -0.025 0.105 -0.000 -0.079 -0.036 0.066 -0.180 0.211 -0.170 -0.067 -0.216 0.075 0.128 -0.171 0.026 -0.017 0.064 0.059 0.050 0.098 0.032 0.044 0.212 0.077 -0.097 0.045 0.388 -0.365 -0.119 0.052 0.089 -0.082 -0.073 -0.207 0.112
Dimension 5 0.037 0.011 0.017 0.007 -0.002 -0.035 0.008 0.001 -0.013 -0.027 -0.027 -0.016 -0.016 -0.003 -0.021 0.021 0.001 0.023 0.004 -0.051 0.011 0.020 -0.024 -0.021 -0.031 0.017 -0.001 -0.050 -0.019 0.032 -0.065 0.001 -0.044 0.183 0.136 -0.055 0.124 -0.040 -0.221 -0.160 0.216 -0.054 0.003 -0.101 0.021 -0.228 -0.170 -0.078 0.020 -0.125 -0.028 -0.014 -0.185 -0.065 0.046 -0.061 -0.097 -0.165 -0.098 -0.177 0.097 -0.004 -0.027 0.018 -0.055 -0.028 0.390 -0.407 -0.393 0.185 -0.158
Dimension 6 0.034 0.009 0.004 0.096 -0.013 -0.059 0.047 -0.059 -0.099 0.032 -0.006 -0.046 0.002 -0.044 -0.031 -0.065 0.012 -0.034 0.030 0.013 -0.017 -0.017 0.005 0.030 -0.017 -0.027 0.003 -0.021 0.145 -0.011 -0.052 -0.144 0.041 -0.044 -0.044 0.140 0.063 0.082 -0.113 -0.065 -0.021 0.095 -0.012 0.018 -0.053 0.075 0.179 0.093 0.124 0.130 -0.488 -0.026 -0.108 -0.033 0.013 -0.477 -0.391 0.213 0.072 0.138 0.220 0.076 0.032 0.002 -0.003 0.016 0.049 -0.043 0.043 -0.032 0.052
Dimension 7 0.027 0.043 -0.042 0.090 0.008 0.004 0.024 0.039 -0.232 -0.046 0.075 0.005 0.004 -0.035 -0.067 -0.143 0.056 -0.062 0.050 0.015 -0.085 0.012 -0.016 0.040 -0.042 -0.106 0.006 0.053 0.223 -0.027 -0.077 -0.244 0.004 0.105 0.004 0.042 0.236 -0.036 -0.279 -0.107 0.181 -0.102 -0.081 0.068 -0.062 0.048 0.344 -0.107 0.191 0.363 0.248 0.050 -0.095 -0.011 0.029 0.229 0.147 -0.037 0.092 -0.031 -0.115 0.098 -0.046 0.072 -0.018 -0.058 -0.019 -0.034 -0.012 -0.120 0.148
Dimension 8 0.025 0.054 0.031 -0.039 0.001 -0.055 0.022 -0.012 0.035 0.118 -0.225 -0.082 0.003 -0.030 -0.082 0.088 0.058 -0.005 0.027 -0.081 -0.012 -0.072 -0.032 -0.016 -0.110 0.014 0.005 -0.221 0.175 0.051 -0.048 -0.200 0.001 0.108 0.083 0.052 0.187 -0.207 -0.169 0.113 0.136 -0.192 0.080 -0.086 0.074 0.059 -0.186 -0.206 0.154 -0.215 0.060 -0.128 0.054 0.130 0.104 0.014 -0.076 -0.013 -0.091 -0.058 0.201 0.066 0.118 -0.086 0.021 0.083 -0.222 0.172 0.352 0.112 -0.162
Dimension 9 0.024 -0.083 0.013 -0.076 0.016 0.036 0.032 -0.037 0.138 0.054 0.105 0.050 0.049 0.029 0.112 -0.055 -0.066 0.011 -0.044 0.056 0.050 -0.011 0.024 0.014 0.112 0.045 -0.000 0.149 -0.071 0.083 0.022 0.147 0.107 0.361 0.256 0.103 -0.259 0.101 -0.249 -0.177 0.355 -0.110 -0.047 0.022 -0.022 0.028 0.021 0.024 -0.141 -0.019 -0.071 0.167 -0.244 -0.163 0.005 -0.046 -0.003 0.002 0.041 -0.025 0.000 -0.132 0.054 -0.113 -0.048 0.140 -0.178 0.251 0.126 -0.042 0.045
Dimension 10 0.022 0.039 -0.069 -0.105 0.144 -0.098 0.144 0.032 0.071 0.001 0.281 -0.136 0.114 -0.035 -0.104 -0.325 -0.006 0.038 0.016 -0.040 -0.216 -0.067 0.001 -0.076 -0.108 -0.188 0.006 0.381 0.262 0.036 0.090 -0.230 0.090 0.033 0.056 -0.088 -0.336 0.064 0.137 0.066 0.001 0.009 0.068 -0.051 0.064 -0.096 -0.076 0.097 0.151 -0.155 -0.017 -0.053 0.051 0.044 0.086 0.053 -0.012 -0.052 -0.065 -0.029 0.039 0.133 0.001 0.020 0.063 -0.050 0.068 0.028 0.010 0.097 -0.063
Dimension 11 0.021 -0.041 0.048 0.088 -0.159 0.119 -0.120 -0.239 0.067 -0.175 -0.204 0.030 -0.012 -0.004 0.043 0.174 -0.017 0.011 -0.009 0.057 0.153 -0.112 0.057 0.051 0.063 0.162 0.013 -0.252 0.341 0.017 0.041 -0.275 0.292 0.046 0.056 -0.058 -0.301 0.144 0.057 -0.057 0.017 0.071 -0.063 0.048 -0.059 -0.049 -0.043 0.159 0.076 -0.081 0.051 0.004 -0.038 -0.009 0.059 0.153 0.086 0.085 0.056 0.035 0.087 0.046 -0.175 0.244 -0.007 0.051 0.082 0.016 -0.062 -0.015 0.040
Dimension 12 0.020 0.077 0.043 -0.135 0.112 -0.150 0.225 0.045 0.212 0.132 -0.297 -0.222 0.011 -0.032 -0.011 -0.047 0.015 0.170 0.069 -0.016 -0.071 -0.109 -0.126 -0.027 -0.051 -0.079 -0.001 -0.183 -0.054 0.006 0.054 0.157 -0.018 -0.003 -0.014 0.038 -0.099 0.090 -0.012 -0.040 -0.000 0.069 0.154 -0.088 0.096 0.027 0.201 0.095 -0.013 0.160 0.129 0.002 -0.112 -0.014 0.018 0.213 0.154 0.294 0.007 0.130 0.322 -0.025 0.132 -0.159 0.063 0.017 0.114 -0.074 -0.137 -0.042 0.088
Dimension 13 0.019 0.067 0.050 -0.099 0.058 0.074 0.048 0.141 0.054 0.119 -0.249 -0.014 -0.033 0.007 0.053 0.081 -0.016 0.178 0.072 0.007 0.030 -0.079 -0.085 0.013 0.057 -0.029 0.015 -0.168 0.206 0.114 0.072 -0.057 0.158 0.053 0.093 -0.075 -0.138 -0.028 0.142 0.031 0.006 -0.066 0.042 -0.022 0.069 -0.070 0.190 -0.042 0.001 0.243 -0.129 0.033 0.118 -0.022 -0.027 -0.183 -0.117 -0.329 -0.083 -0.125 -0.427 -0.113 0.119 -0.180 0.119 -0.220 0.018 0.001 -0.024 0.037 0.028
Dimension 14 0.018 0.089 -0.030 -0.115 0.079 -0.071 -0.056 0.160 0.020 -0.122 -0.274 -0.205 0.027 -0.053 -0.151 -0.069 0.091 0.012 0.020 -0.070 -0.141 -0.050 -0.064 -0.031 -0.175 -0.076 0.002 -0.167 -0.135 0.020 0.068 0.015 -0.323 0.102 0.174 -0.006 -0.079 0.018 0.099 -0.020 -0.022 -0.008 -0.214 0.133 -0.126 -0.010 -0.125 -0.004 0.145 -0.106 -0.087 0.168 -0.066 -0.144 -0.131 -0.110 0.024 -0.049 0.136 0.013 -0.201 0.250 -0.122 0.221 -0.116 0.145 -0.078 0.086 -0.060 -0.137 0.083
Dimension 15 0.017 -0.005 -0.017 0.086 -0.038 -0.005 -0.011 -0.029 -0.114 -0.023 0.095 0.022 -0.044 0.012 -0.008 0.055 0.008 -0.049 -0.001 0.012 0.017 0.058 0.036 -0.001 0.003 0.008 0.003 0.028 -0.024 0.625 -0.013 -0.017 -0.082 0.100 0.585 -0.023 0.154 -0.003 0.140 -0.053 -0.156 0.113 0.146 -0.048 0.103 -0.021 0.048 0.025 -0.005 0.057 0.031 -0.063 0.121 0.043 -0.045 0.021 0.026 0.137 -0.030 0.163 0.029 0.009 -0.023 0.031 0.039 -0.009 -0.010 -0.040 0.005 -0.021 0.024
Dimension 16 0.016 -0.006 -0.016 -0.104 0.017 0.059 -0.004 0.007 0.113 0.016 0.048 0.083 -0.009 0.030 0.019 0.023 0.013 -0.064 -0.018 -0.022 -0.004 -0.031 0.015 -0.023 -0.003 -0.010 0.030 0.093 0.181 0.053 0.092 -0.172 0.186 -0.041 -0.010 0.176 0.103 -0.272 0.090 0.341 -0.061 -0.214 0.082 -0.024 0.090 0.170 -0.083 -0.236 -0.043 -0.144 -0.036 0.208 -0.173 -0.190 -0.214 -0.083 0.116 0.144 0.028 0.145 -0.036 -0.110 0.016 -0.071 -0.037 0.115 0.054 -0.079 -0.309 -0.184 0.155
Dimension 17 0.015 -0.035 0.020 -0.072 0.020 -0.021 0.078 -0.003 0.159 -0.029 -0.028 0.052 0.041 -0.015 0.064 -0.070 -0.015 0.073 -0.011 0.019 0.009 -0.031 -0.074 0.032 0.047 -0.016 0.014 0.078 0.064 0.248 -0.091 0.085 0.075 -0.131 0.068 0.018 0.036 -0.049 0.058 0.272 -0.191 -0.003 -0.359 0.091 -0.290 0.057 0.141 0.011 0.046 0.188 0.051 -0.019 -0.053 0.041 0.024 0.013 -0.033 -0.134 0.066 -0.356 0.235 -0.103 -0.029 -0.020 -0.217 0.310 0.029 -0.029 0.016 0.134 -0.161
Dimension 18 0.014 -0.001 0.001 0.003 -0.001 0.001 -0.002 0.008 -0.000 -0.008 -0.002 -0.007 -0.000 -0.000 0.004 -0.001 -0.001 0.006 -0.001 -0.002 0.001 0.001 0.000 0.000 0.003 0.000 0.998 0.012 -0.013 -0.014 -0.007 0.017 -0.013 -0.001 -0.005 -0.007 0.007 0.009 -0.008 -0.021 0.005 0.008 0.008 -0.000 0.009 -0.002 -0.004 0.006 0.007 -0.002 0.001 -0.010 0.013 0.011 0.008 0.002 -0.004 0.003 -0.002 0.009 -0.005 -0.011 -0.008 0.002 -0.002 0.014 -0.001 -0.003 0.008 0.001 -0.004
Dimension 19 0.013 0.136 -0.025 -0.080 0.155 -0.134 0.109 0.198 -0.005 -0.181 0.030 0.160 -0.135 0.096 0.105 0.210 -0.004 0.402 0.098 -0.030 -0.018 -0.016 -0.071 -0.015 0.047 -0.092 -0.018 0.146 0.195 -0.117 -0.012 -0.115 -0.042 0.025 -0.004 -0.046 0.210 0.009 0.097 -0.311 0.031 0.089 -0.103 0.057 -0.106 -0.082 -0.125 0.005 0.068 -0.068 0.012 -0.005 0.070 -0.031 -0.112 -0.031 0.036 0.148 0.026 0.134 0.037 -0.352 -0.181 -0.052 -0.018 0.056 -0.103 0.035 0.038 0.037 -0.066
Dimension 20 0.013 0.113 -0.041 0.004 0.170 -0.279 0.089 0.177 -0.243 -0.035 0.052 0.052 0.014 0.001 -0.091 0.062 0.060 0.034 0.070 -0.026 -0.076 0.065 0.015 0.006 -0.109 -0.067 0.020 -0.183 -0.188 0.218 -0.010 0.001 0.559 -0.107 -0.059 0.010 -0.002 0.016 -0.102 0.024 0.018 0.005 -0.139 0.049 -0.128 0.026 -0.099 0.014 -0.315 -0.095 -0.012 0.040 -0.091 -0.044 0.004 0.022 0.033 0.018 0.045 -0.017 0.028 0.151 -0.035 0.059 0.041 -0.245 -0.029 0.056 0.024 0.026 -0.020
Dimension 21 0.012 0.052 -0.026 0.026 -0.032 -0.022 -0.003 -0.207 0.024 0.109 0.170 0.130 -0.057 0.025 -0.021 0.132 -0.009 0.004 0.041 0.025 0.013 0.003 0.043 0.002 -0.006 -0.010 0.030 -0.097 -0.012 -0.038 0.101 -0.117 -0.296 0.094 0.039 0.013 -0.029 -0.076 0.205 0.058 -0.076 0.031 -0.178 -0.007 -0.149 -0.107 0.025 -0.011 0.086 -0.010 0.028 0.118 -0.266 -0.162 -0.094 -0.011 0.067 0.011 0.044 -0.230 0.272 0.023 0.109 -0.089 0.226 -0.504 -0.092 0.108 -0.001 0.030 0.024
Dimension 22 0.012 -0.062 0.034 -0.006 -0.035 0.119 -0.085 0.015 0.041 0.025 -0.106 -0.052 0.022 -0.034 0.029 -0.124 -0.043 -0.072 -0.045 -0.016 0.017 -0.026 -0.089 0.027 0.046 0.014 0.001 0.068 0.035 0.146 0.027 0.029 -0.085 -0.062 -0.063 0.081 -0.073 0.006 -0.129 0.018 0.077 -0.056 -0.226 0.072 -0.164 0.221 -0.036 -0.030 0.081 0.037 0.011 0.031 0.025 -0.039 -0.073 0.035 0.082 0.218 0.005 0.382 -0.186 -0.017 0.025 -0.021 0.134 -0.242 0.013 -0.011 -0.087 0.396 -0.453
Dimension 23 0.012 0.005 0.006 -0.044 0.001 0.060 -0.008 0.006 0.124 -0.097 -0.073 0.011 -0.025 0.021 0.046 -0.017 -0.026 0.089 -0.017 -0.016 0.005 -0.046 -0.003 -0.027 0.017 0.007 -0.003 0.073 0.063 0.569 -0.008 0.014 -0.237 -0.271 -0.396 0.025 -0.025 0.082 -0.259 -0.052 0.113 0.018 0.091 0.003 0.044 0.042 -0.105 0.044 0.107 -0.154 -0.019 0.015 -0.094 0.009 0.088 0.006 -0.023 -0.098 0.017 -0.113 -0.011 -0.118 -0.097 0.037 0.020 -0.177 -0.003 0.017 0.048 -0.186 0.209

Discussion 2.3: Interpret Principal Components

(Double-click this cell and replace this text with your own text, reporting your observations from detailed investigation of the first few principal components generated. Can we interpret positive and negative values from them in a meaningful way?)

Dimension 1

This one is hard because it correlates with so many variables

Positively correlates with:

KBA05_ANTG3 high share of 6-10 family homes

KBA05_ANTG4h high share of 10+ family homes Do they live in a densely populated area?

CAMEO_INTL_2015_WEALTH. Positively correlated. High number means poor.

FINANZ_SPARER saver. High number means low affinity. Not saver.

FINANZ_HAUSBAUER home owner. POsitive correlation. High number means low affinity. Not a home owner.

Negative correlation:

MINIMALIST: low financial interest. Low number means high affinity. So they do have low financial interest, but i'm not sure what that means in this context

Observations:

Relatively poor people who don't save, don't own a home. Live an a densely populated area. Lower social status

Dimension 2

Positively correlates with

ALTERSKATEGORIE_GROB, age. Older people

LUST: sensual-minded. High number is lowest affinity.

ERL: event-oriented. High number is lowest affinity.

Not sensual-minded, not event-oriented

FINANZ_VORSORGER: be prepared. High number is lowest affinity. Not financially prepared.

Negative correlation:

SEMIO_REL: religious. Negative correlation. Low number means high affinity. Religious.

FINANZ_SPARER money saver. Negative correlation. Low number means high affinity.

FINANZ_ANLEGER investor. Negative correlation. Low number means high affinity. This contradicts to not financially prepared.

PRAEGENDE_JUGENDJAHRE_DECADE negatively correlates with decade, indicating an older decade. Makes sense. These are older people

PFLICHT: dutiful

TRADV: tradionally-minded. Low score means high affinity

Observations:

Older people, traditional, dutiful, religious, conservative, not event oriented, not sensually minded. Save and invest money but maybe still think they're not prepard for retirement (this is just an assumption)

Dimension 3

Positively correlates with :

SEMIO_VERT, which is dreamful personality

SOZ: socially-minded

FAM: family-minded

Small number indicates high affinity. Large number indicates small affinity. So it's a NEGATIVE correlation then.

They are not socially_minded, family-minded,

Negative correlation:

Feature 3 negatively correlates with gender ANREDE_KZ, 1 being male and 2 being female. So feature 3 must be associated with males.

negatively correlates with:

FINANZ_ANLEGER - investor

SEMIO_KAEM - combative attitude

SEMIO_KRIT: critical-minded

SEMIO_DOM: dominant-minded

however, the scale there is 1: highest affinity, 7: lowest affinity so the correlation is actually positive then.

Observations:

Male. Not dreamful, socially minded and family minded. An investor with a combative attitude, critical-minded, dominant.

Step 3: Clustering

Step 3.1: Apply Clustering to General Population

You've assessed and cleaned the demographics data, then scaled and transformed them. Now, it's time to see how the data clusters in the principal components space. In this substep, you will apply k-means clustering to the dataset and use the average within-cluster distances from each point to their assigned cluster's centroid to decide on a number of clusters to keep.

  • Use sklearn's KMeans class to perform k-means clustering on the PCA-transformed data.
  • Then, compute the average difference from each point to its assigned cluster's center. Hint: The KMeans object's .score() method might be useful here, but note that in sklearn, scores tend to be defined so that larger is better. Try applying it to a small, toy dataset, or use an internet search to help your understanding.
  • Perform the above two steps for a number of different cluster counts. You can then see how the average distance decreases with an increasing number of clusters. However, each additional cluster provides a smaller net benefit. Use this fact to select a final number of clusters in which to group the data. Warning: because of the large size of the dataset, it can take a long time for the algorithm to resolve. The more clusters to fit, the longer the algorithm will take. You should test for cluster counts through at least 10 clusters to get the full picture, but you shouldn't need to test for a number of clusters above about 30.
  • Once you've selected a final number of clusters to use, re-fit a KMeans instance to perform the clustering operation. Make sure that you also obtain the cluster assignments for the general demographics data, since you'll be using them in the final Step 3.3.
In [228]:
# Use a sample to reduce computation time
# Borrowed from:
# https://github.com/MiguelAMartinez/identify-customer-segments-Arvato/blob/master/Identify_Customer_Segments.ipynb

# https://stackoverflow.com/questions/22994423/difference-between-np-random-seed-and-np-random-randomstate
np.random.seed(42)
# https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.choice.html
x = 0.25

X_pca_test_subset = X_pca[np.random.choice(X_pca.shape[0], int(X_pca.shape[0]*x), replace=False)]

print('Taking', x*100, 'percent of the data: ' , len(X_pca_test_subset), 'rows')

X_pca_test_subset
Taking 25.0 percent of the data:  199516 rows
Out[228]:
array([[ 1.35770809,  4.84583696, -0.88853132, ...,  0.31667212,
         0.12803458,  0.5365943 ],
       [ 2.15221546,  2.32769318, -1.78029203, ...,  0.59863523,
        -0.47901373, -0.50870354],
       [ 4.41319996,  3.21135748,  3.9577333 , ..., -0.04890789,
        -1.35555171, -0.45538185],
       ..., 
       [ 5.16483812,  5.06477375, -0.85087555, ...,  0.72772376,
        -0.34920901,  0.54752801],
       [ 0.25166879, -4.3784477 , -3.37640753, ..., -1.2435927 ,
        -0.63095716,  0.25152027],
       [ 5.96124379, -2.41743762, -2.9865717 , ...,  0.12979854,
         0.51238191,  0.41753436]])
In [229]:
X_pca
Out[229]:
array([[ 3.40924506, -2.98839272, -3.12515919, ..., -0.54596531,
         0.08406835,  1.18116573],
       [-1.20625655,  0.64178459, -2.70947703, ...,  0.96223871,
        -0.61896201, -1.30354837],
       [-4.02314636,  1.73019217, -0.73512714, ..., -0.9805615 ,
         1.06419267, -0.67000887],
       ..., 
       [-2.01092882, -3.15269442, -3.15102529, ...,  0.08738137,
         1.17673668, -0.69580493],
       [ 6.13439945, -3.98322046,  2.2249162 , ..., -1.07676273,
         0.60650368, -0.4018651 ],
       [-0.37539902,  0.8631471 ,  2.85234551, ...,  0.23989573,
        -0.16909422,  0.38225317]])
In [230]:
# len(X_pca) # 798067
In [231]:
# len(X_pca_test_subset)
In [232]:
# 159613/ 798067 # 0.1999994987889488
In [233]:
# 199516/ 798067 # 0.249999060229279
In [234]:
# Over a number of different cluster counts..
    # run k-means clustering on the data and...    
    # compute the average within-cluster distances.
"""https://stackoverflow.com/questions/48607546/k-means-cluster-method-score-negative
https://stackoverflow.com/questions/51138686/how-to-use-silhouette-score-in-k-means-clustering-from-sklearn-library
https://www.datacamp.com/community/tutorials/seaborn-python-tutorial
https://stackoverflow.com/questions/36220829/fine-control-over-the-font-size-in-seaborn-plots-for-academic-papers/36222162
https://stackoverflow.com/questions/16424724/how-can-i-fix-a-memoryerror-when-executing-scikit-learns-silhouette-score
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html

Runtime calculation taken from:
https://github.com/MiguelAMartinez/identify-customer-segments-Arvato/blob/master/Identify_Customer_Segments.ipynb
"""
def choose_k(my_pca_data, k_min, k_max):
    
    start_time = time.time()
    
    with active_session():
        list_k = []
        list_score = []
        
        for k in range(k_min,k_max+1):

            # run k-means
            kmeans = KMeans(k, random_state=42)
            
            model = kmeans.fit(my_pca_data)
            
            # vaidate with score(?)
            score = model.score(my_pca_data) # I'm not sure about this one, if I'm doing it right.
            # Checked other guys' submmissions, seems right. But they use abs() to get the abs value
                        
            # print reports
            print("k:", k)
            print("score:", score)    
            print("--------------------------------")
            print()
          
            # append
            list_k.append(k)
            list_score.append(score)

    df_k = pd.DataFrame(
            {'k': list_k,
             'score': list_score
             })

    print(df_k)
    
    # Investigate the change in within-cluster distance across number of clusters.
    # HINT: Use matplotlib's plot function to visualize this relationship.
  
    plt.rcParams["figure.figsize"] = (30,10)
    ax = sns.barplot(x="k", y="score", data=df_k)
    ax.set_title("Number of clusters and score",fontsize=40)  
    ax.set_xlabel("Number of clusters",fontsize=30)
    ax.set_ylabel("score",fontsize=30)
    ax.tick_params(labelsize=25)
    # Show the plot
    plt.show()    
    
    print("--- Run time: %s mins ---" % np.round(((time.time() - start_time)/60),2))
In [235]:
# this took half hour to run.
# Taking 25.0 percent of the data:  199516 rows
# Do NOT rerun it again, if you decide to re-run the whole project :) 
choose_k(X_pca_test_subset,k_min=2,k_max=30)
k: 2
score: -9534527.16974
--------------------------------

k: 3
score: -8774898.55988
--------------------------------

k: 4
score: -8241957.10428
--------------------------------

k: 5
score: -7834534.54164
--------------------------------

k: 6
score: -7439194.10916
--------------------------------

k: 7
score: -7177500.38958
--------------------------------

k: 8
score: -6946275.59724
--------------------------------

k: 9
score: -6772419.52538
--------------------------------

k: 10
score: -6595858.39621
--------------------------------

k: 11
score: -6439090.29896
--------------------------------

k: 12
score: -6289897.42137
--------------------------------

k: 13
score: -6132782.31472
--------------------------------

k: 14
score: -6067388.36798
--------------------------------

k: 15
score: -5984729.48827
--------------------------------

k: 16
score: -5878415.35592
--------------------------------

k: 17
score: -5768672.69264
--------------------------------

k: 18
score: -5623220.74438
--------------------------------

k: 19
score: -5524190.60956
--------------------------------

k: 20
score: -5501596.0826
--------------------------------

k: 21
score: -5424206.73288
--------------------------------

k: 22
score: -5337477.04185
--------------------------------

k: 23
score: -5236914.29316
--------------------------------

k: 24
score: -5203761.74678
--------------------------------

k: 25
score: -5164968.72856
--------------------------------

k: 26
score: -5079868.55618
--------------------------------

k: 27
score: -5024902.791
--------------------------------

k: 28
score: -5007132.56362
--------------------------------

k: 29
score: -4972536.54877
--------------------------------

k: 30
score: -4939278.31406
--------------------------------

     k        score
0    2 -9534527.170
1    3 -8774898.560
2    4 -8241957.104
3    5 -7834534.542
4    6 -7439194.109
5    7 -7177500.390
6    8 -6946275.597
7    9 -6772419.525
8   10 -6595858.396
9   11 -6439090.299
10  12 -6289897.421
11  13 -6132782.315
12  14 -6067388.368
13  15 -5984729.488
14  16 -5878415.356
15  17 -5768672.693
16  18 -5623220.744
17  19 -5524190.610
18  20 -5501596.083
19  21 -5424206.733
20  22 -5337477.042
21  23 -5236914.293
22  24 -5203761.747
23  25 -5164968.729
24  26 -5079868.556
25  27 -5024902.791
26  28 -5007132.564
27  29 -4972536.549
28  30 -4939278.314
--- Run time: 24.62 mins ---
In [236]:
# this took 75-90 mins to run
# my understanding is here is how k means run time increases: if you add 10 more rows, it increases by 100 units(??)
# if you increase rows by 10 times, run time will increase by 100 times(?)
# maybe they meant computation complexity
# no, it doesn't look like it here. I increased data x4 (took all data), run time increased by 3 times.
# Using all the data
# Definitely do NOT rerun it again, if you decide to re-run the whole project :) 
choose_k(X_pca,k_min=2,k_max=25)
k: 2
score: -37972652.8644
--------------------------------

k: 3
score: -34928060.341
--------------------------------

k: 4
score: -32807608.9863
--------------------------------

k: 5
score: -31177923.7077
--------------------------------

k: 6
score: -29619317.6603
--------------------------------

k: 7
score: -28685014.9498
--------------------------------

k: 8
score: -27674660.5294
--------------------------------

k: 9
score: -26911374.8197
--------------------------------

k: 10
score: -26240156.0953
--------------------------------

k: 11
score: -25612716.8858
--------------------------------

k: 12
score: -24878265.0983
--------------------------------

k: 13
score: -24430332.1836
--------------------------------

k: 14
score: -23932669.4846
--------------------------------

k: 15
score: -23776349.5293
--------------------------------

k: 16
score: -23284164.3846
--------------------------------

k: 17
score: -22778097.8015
--------------------------------

k: 18
score: -22704281.6858
--------------------------------

k: 19
score: -22008842.4776
--------------------------------

k: 20
score: -22008573.0937
--------------------------------

k: 21
score: -21626988.7403
--------------------------------

k: 22
score: -21213968.3012
--------------------------------

k: 23
score: -21022459.3969
--------------------------------

k: 24
score: -20851565.9562
--------------------------------

k: 25
score: -20671245.26
--------------------------------

     k         score
0    2 -37972652.864
1    3 -34928060.341
2    4 -32807608.986
3    5 -31177923.708
4    6 -29619317.660
5    7 -28685014.950
6    8 -27674660.529
7    9 -26911374.820
8   10 -26240156.095
9   11 -25612716.886
10  12 -24878265.098
11  13 -24430332.184
12  14 -23932669.485
13  15 -23776349.529
14  16 -23284164.385
15  17 -22778097.802
16  18 -22704281.686
17  19 -22008842.478
18  20 -22008573.094
19  21 -21626988.740
20  22 -21213968.301
21  23 -21022459.397
22  24 -20851565.956
23  25 -20671245.260
--- Run time: 75.23 mins ---
In [237]:
# Re-fit the k-means model with the selected number of clusters and obtain
# cluster predictions for the general population demographics data.

kmeans_10 = KMeans(n_clusters = 10, random_state=42)

model_10 = kmeans_10.fit(X_pca)

clusters_general = model_10.predict(X_pca)
In [238]:
kmeans_10
Out[238]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=10, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=42, tol=0.0001, verbose=0)
In [239]:
model_10
Out[239]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=10, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=42, tol=0.0001, verbose=0)
In [240]:
clusters_general
Out[240]:
array([3, 7, 8, ..., 2, 9, 6], dtype=int32)
In [241]:
len(clusters_general)
Out[241]:
798067
In [242]:
# This number is the same as when i ran choose_k(X_pca,k_min=2,k_max=25) on k = 10
score_10 = model_10.score(X_pca)
score_10
Out[242]:
-26240156.095346279
In [243]:
clusters = pd.DataFrame ({'clusters_general' : clusters_general})
clusters
Out[243]:
clusters_general
0 3
1 7
2 8
3 6
4 2
5 2
6 0
7 7
8 7
9 5
10 4
11 2
12 5
13 2
14 2
15 7
16 2
17 9
18 9
19 7
20 0
21 4
22 7
23 4
24 7
25 3
26 0
27 1
28 7
29 5
30 4
31 2
32 7
33 8
34 2
35 2
36 2
37 8
38 8
39 7
40 0
41 4
42 0
43 3
44 5
45 0
46 3
47 2
48 9
49 6
50 9
51 3
52 5
53 9
54 4
55 8
56 7
57 5
58 5
59 8
60 1
61 8
62 0
63 9
64 2
65 3
66 6
67 8
68 8
69 7
70 2
71 0
72 7
73 2
74 6
75 8
76 0
77 9
78 0
79 7
80 8
81 4
82 6
83 7
84 1
85 9
86 5
87 7
88 4
89 7
90 4
91 0
92 6
93 5
94 8
95 7
96 2
97 0
98 7
99 5
... ...
797967 5
797968 3
797969 7
797970 8
797971 5
797972 4
797973 5
797974 3
797975 0
797976 8
797977 1
797978 2
797979 4
797980 3
797981 8
797982 3
797983 7
797984 2
797985 0
797986 7
797987 6
797988 6
797989 1
797990 2
797991 1
797992 1
797993 1
797994 1
797995 0
797996 5
797997 6
797998 2
797999 1
798000 5
798001 8
798002 5
798003 6
798004 8
798005 1
798006 4
798007 2
798008 8
798009 7
798010 0
798011 7
798012 3
798013 5
798014 6
798015 4
798016 7
798017 7
798018 7
798019 4
798020 0
798021 2
798022 6
798023 4
798024 2
798025 7
798026 7
798027 5
798028 3
798029 3
798030 3
798031 7
798032 3
798033 0
798034 2
798035 7
798036 7
798037 8
798038 3
798039 6
798040 7
798041 5
798042 3
798043 7
798044 7
798045 9
798046 3
798047 6
798048 1
798049 5
798050 4
798051 6
798052 2
798053 7
798054 4
798055 3
798056 6
798057 6
798058 1
798059 7
798060 3
798061 3
798062 2
798063 6
798064 2
798065 9
798066 6

798067 rows × 1 columns

In [244]:
clusters['clusters_general'].value_counts().sort_index()
Out[244]:
0     63235
1     61591
2     84570
3     75084
4     88195
5     88080
6     86591
7    109102
8     82741
9     58878
Name: clusters_general, dtype: int64
In [245]:
fig, ax = plt.subplots(1, 1,figsize=(10, 7))
clusters['clusters_general'].value_counts().sort_index().plot('barh').invert_yaxis() 

# title and axis labels
plt.title('General Population Clusters')
ax.set_ylabel('Cluster')
ax.set_xlabel('Frequency')

# commas for x axis 
fmt = '{x:,.0f}'
tick = mtick.StrMethodFormatter(fmt)
ax.xaxis.set_major_formatter(tick)

plt.show()  

Discussion 3.1: Apply Clustering to General Population

(Double-click this cell and replace this text with your own text, reporting your findings and decisions regarding clustering. Into how many clusters have you decided to segment the population?)

The number of clusters should be manageable for the marketing campaign purposes and easily explainable to stakeholder and CEO. Therefore, clusters more than 10 are not recommended (although it depends on how the clusters would be used). For day to day marketing operations, up to 10 is okay, more than 10 gets too granular and hard to manage. It's not feasible to have more than 10 marketing approaches.

Somewhat arbitrarily, I'll go with 10 clusters.

I do not see an obvious elbow, but after 10 clusters I do not observe a very big improvement in score. Under 10, especially under 7, there is a big improvement in score.

I also ran K-means with 2 to 25 clusters on only 25% of data vs on all data. The relationship between k and score is approximately the same in both runs (subset vs all data), as see in the graph. So it appears that for this dataset, if you want to choose k, you can just run k means on a subset.

Cluster 7 is the biggest.

Step 3.2: Apply All Steps to the Customer Data

Now that you have clusters and cluster centers for the general population, it's time to see how the customer data maps on to those clusters. Take care to not confuse this for re-fitting all of the models to the customer data. Instead, you're going to use the fits from the general population to clean, transform, and cluster the customer data. In the last step of the project, you will interpret how the general population fits apply to the customer data.

  • Don't forget when loading in the customers data, that it is semicolon (;) delimited.
  • Apply the same feature wrangling, selection, and engineering steps to the customer demographics using the clean_data() function you created earlier. (You can assume that the customer demographics data has similar meaning behind missing data patterns as the general demographics data.)
  • Use the sklearn objects from the general demographics data, and apply their transformations to the customers data. That is, you should not be using a .fit() or .fit_transform() method to re-fit the old objects, nor should you be creating new sklearn objects! Carry the data through the feature scaling, PCA, and clustering steps, obtaining cluster assignments for all of the data in the customer demographics data.
In [246]:
# Load in the customer demographics data.
customers = pd.read_csv("Udacity_CUSTOMERS_Subset.csv", sep=';')

Apply preprocessing, feature transformation, and clustering from the general demographics

onto the customer data, obtaining cluster predictions for the customer demographics data.

In [247]:
customers.shape
Out[247]:
(191652, 85)
In [248]:
customers.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 191652 entries, 0 to 191651
Data columns (total 85 columns):
AGER_TYP                 191652 non-null int64
ALTERSKATEGORIE_GROB     191652 non-null int64
ANREDE_KZ                191652 non-null int64
CJT_GESAMTTYP            188439 non-null float64
FINANZ_MINIMALIST        191652 non-null int64
FINANZ_SPARER            191652 non-null int64
FINANZ_VORSORGER         191652 non-null int64
FINANZ_ANLEGER           191652 non-null int64
FINANZ_UNAUFFAELLIGER    191652 non-null int64
FINANZ_HAUSBAUER         191652 non-null int64
FINANZTYP                191652 non-null int64
GEBURTSJAHR              191652 non-null int64
GFK_URLAUBERTYP          188439 non-null float64
GREEN_AVANTGARDE         191652 non-null int64
HEALTH_TYP               191652 non-null int64
LP_LEBENSPHASE_FEIN      188439 non-null float64
LP_LEBENSPHASE_GROB      188439 non-null float64
LP_FAMILIE_FEIN          188439 non-null float64
LP_FAMILIE_GROB          188439 non-null float64
LP_STATUS_FEIN           188439 non-null float64
LP_STATUS_GROB           188439 non-null float64
NATIONALITAET_KZ         191652 non-null int64
PRAEGENDE_JUGENDJAHRE    191652 non-null int64
RETOURTYP_BK_S           188439 non-null float64
SEMIO_SOZ                191652 non-null int64
SEMIO_FAM                191652 non-null int64
SEMIO_REL                191652 non-null int64
SEMIO_MAT                191652 non-null int64
SEMIO_VERT               191652 non-null int64
SEMIO_LUST               191652 non-null int64
SEMIO_ERL                191652 non-null int64
SEMIO_KULT               191652 non-null int64
SEMIO_RAT                191652 non-null int64
SEMIO_KRIT               191652 non-null int64
SEMIO_DOM                191652 non-null int64
SEMIO_KAEM               191652 non-null int64
SEMIO_PFLICHT            191652 non-null int64
SEMIO_TRADV              191652 non-null int64
SHOPPER_TYP              191652 non-null int64
SOHO_KZ                  145056 non-null float64
TITEL_KZ                 145056 non-null float64
VERS_TYP                 191652 non-null int64
ZABEOTYP                 191652 non-null int64
ALTER_HH                 145056 non-null float64
ANZ_PERSONEN             145056 non-null float64
ANZ_TITEL                145056 non-null float64
HH_EINKOMMEN_SCORE       188684 non-null float64
KK_KUNDENTYP             79715 non-null float64
W_KEIT_KIND_HH           137910 non-null float64
WOHNDAUER_2008           145056 non-null float64
ANZ_HAUSHALTE_AKTIV      141725 non-null float64
ANZ_HH_TITEL             139542 non-null float64
GEBAEUDETYP              141725 non-null float64
KONSUMNAEHE              145001 non-null float64
MIN_GEBAEUDEJAHR         141725 non-null float64
OST_WEST_KZ              141725 non-null object
WOHNLAGE                 141725 non-null float64
CAMEO_DEUG_2015          141224 non-null object
CAMEO_DEU_2015           141224 non-null object
CAMEO_INTL_2015          141224 non-null object
KBA05_ANTG1              135672 non-null float64
KBA05_ANTG2              135672 non-null float64
KBA05_ANTG3              135672 non-null float64
KBA05_ANTG4              135672 non-null float64
KBA05_BAUMAX             135672 non-null float64
KBA05_GBZ                135672 non-null float64
BALLRAUM                 141693 non-null float64
EWDICHTE                 141693 non-null float64
INNENSTADT               141693 non-null float64
GEBAEUDETYP_RASTER       141725 non-null float64
KKK                      137392 non-null float64
MOBI_REGIO               135672 non-null float64
ONLINE_AFFINITAET        188439 non-null float64
REGIOTYP                 137392 non-null float64
KBA13_ANZAHL_PKW         140371 non-null float64
PLZ8_ANTG1               138888 non-null float64
PLZ8_ANTG2               138888 non-null float64
PLZ8_ANTG3               138888 non-null float64
PLZ8_ANTG4               138888 non-null float64
PLZ8_BAUMAX              138888 non-null float64
PLZ8_HHZ                 138888 non-null float64
PLZ8_GBZ                 138888 non-null float64
ARBEIT                   141176 non-null float64
ORTSGR_KLS9              141176 non-null float64
RELAT_AB                 141176 non-null float64
dtypes: float64(49), int64(32), object(4)
memory usage: 124.3+ MB
In [249]:
customers.head(10)
Out[249]:
AGER_TYP ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER FINANZTYP GEBURTSJAHR GFK_URLAUBERTYP GREEN_AVANTGARDE HEALTH_TYP LP_LEBENSPHASE_FEIN LP_LEBENSPHASE_GROB LP_FAMILIE_FEIN LP_FAMILIE_GROB LP_STATUS_FEIN LP_STATUS_GROB NATIONALITAET_KZ PRAEGENDE_JUGENDJAHRE RETOURTYP_BK_S SEMIO_SOZ SEMIO_FAM SEMIO_REL SEMIO_MAT SEMIO_VERT SEMIO_LUST SEMIO_ERL SEMIO_KULT SEMIO_RAT SEMIO_KRIT SEMIO_DOM SEMIO_KAEM SEMIO_PFLICHT SEMIO_TRADV SHOPPER_TYP SOHO_KZ TITEL_KZ VERS_TYP ZABEOTYP ALTER_HH ANZ_PERSONEN ANZ_TITEL HH_EINKOMMEN_SCORE KK_KUNDENTYP W_KEIT_KIND_HH WOHNDAUER_2008 ANZ_HAUSHALTE_AKTIV ANZ_HH_TITEL GEBAEUDETYP KONSUMNAEHE MIN_GEBAEUDEJAHR OST_WEST_KZ WOHNLAGE CAMEO_DEUG_2015 CAMEO_DEU_2015 CAMEO_INTL_2015 KBA05_ANTG1 KBA05_ANTG2 KBA05_ANTG3 KBA05_ANTG4 KBA05_BAUMAX KBA05_GBZ BALLRAUM EWDICHTE INNENSTADT GEBAEUDETYP_RASTER KKK MOBI_REGIO ONLINE_AFFINITAET REGIOTYP KBA13_ANZAHL_PKW PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB
0 2 4 1 5.000 5 1 5 1 2 2 2 0 4.000 1 1 20.000 5.000 2.000 2.000 10.000 5.000 1 4 5.000 6 5 2 6 6 7 3 4 1 3 1 1 2 1 3 0.000 0.000 1 3 10.000 2.000 0.000 1.000 nan 6.000 9.000 1.000 0.000 1.000 5.000 1992.000 W 7.000 1 1A 13 2.000 2.000 0.000 0.000 0.000 4.000 3.000 2.000 4.000 4.000 1.000 4.000 3.000 1.000 1201.000 3.000 3.000 1.000 0.000 1.000 5.000 5.000 1.000 2.000 1.000
1 -1 4 1 nan 5 1 5 1 3 2 2 0 nan 0 1 nan nan nan nan nan nan 1 0 nan 3 6 2 6 7 5 3 4 1 3 3 2 4 1 3 0.000 0.000 1 3 11.000 3.000 0.000 nan nan 0.000 9.000 nan nan nan 5.000 nan NaN nan NaN NaN NaN nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
2 -1 4 2 2.000 5 1 5 1 4 4 2 0 3.000 1 2 13.000 3.000 1.000 1.000 10.000 5.000 1 4 5.000 2 2 1 3 3 7 7 1 2 7 5 6 4 1 1 0.000 0.000 2 3 6.000 1.000 0.000 1.000 nan 6.000 9.000 1.000 0.000 8.000 1.000 1992.000 W 2.000 5 5D 34 2.000 2.000 0.000 0.000 0.000 3.000 7.000 4.000 1.000 3.000 3.000 3.000 1.000 7.000 433.000 2.000 3.000 3.000 1.000 3.000 3.000 2.000 3.000 5.000 3.000
3 1 4 1 2.000 5 1 5 2 1 2 6 0 10.000 0 2 0.000 0.000 0.000 0.000 9.000 4.000 1 1 3.000 6 5 3 4 7 5 3 4 3 3 3 3 3 4 0 0.000 0.000 1 1 8.000 0.000 0.000 4.000 nan nan 9.000 0.000 nan 2.000 2.000 1992.000 W 7.000 4 4C 24 3.000 0.000 0.000 0.000 1.000 4.000 7.000 1.000 7.000 4.000 3.000 4.000 2.000 6.000 755.000 3.000 2.000 1.000 0.000 1.000 3.000 4.000 1.000 3.000 1.000
4 -1 3 1 6.000 3 1 4 4 5 2 2 1960 2.000 0 3 31.000 10.000 10.000 5.000 1.000 1.000 1 8 5.000 4 5 4 6 5 6 4 5 5 3 5 2 5 4 1 0.000 0.000 2 1 20.000 4.000 0.000 6.000 2.000 2.000 9.000 7.000 0.000 3.000 1.000 1992.000 W 3.000 7 7B 41 0.000 3.000 2.000 0.000 0.000 3.000 3.000 4.000 4.000 3.000 4.000 3.000 5.000 7.000 513.000 2.000 4.000 2.000 1.000 2.000 3.000 3.000 3.000 5.000 1.000
5 1 3 1 4.000 5 1 5 1 2 3 5 0 11.000 1 3 17.000 5.000 2.000 2.000 7.000 3.000 1 4 3.000 6 4 4 1 7 6 4 6 2 5 5 3 3 4 1 0.000 0.000 2 2 11.000 2.000 0.000 1.000 nan 6.000 9.000 1.000 0.000 1.000 2.000 1992.000 W 1.000 5 5D 34 2.000 2.000 1.000 0.000 0.000 3.000 7.000 5.000 8.000 4.000 2.000 3.000 3.000 3.000 1167.000 2.000 3.000 2.000 1.000 1.000 5.000 5.000 3.000 7.000 5.000
6 2 4 1 2.000 5 1 5 1 1 2 2 1942 10.000 1 2 20.000 5.000 2.000 2.000 10.000 5.000 1 4 5.000 4 2 5 1 6 5 3 4 3 3 1 2 2 4 0 0.000 0.000 1 1 10.000 2.000 0.000 2.000 5.000 6.000 9.000 1.000 0.000 1.000 4.000 1992.000 W 7.000 3 3B 23 4.000 1.000 0.000 0.000 1.000 4.000 6.000 2.000 5.000 4.000 2.000 4.000 4.000 3.000 1300.000 3.000 2.000 1.000 0.000 1.000 5.000 5.000 2.000 3.000 2.000
7 1 4 1 2.000 5 1 5 1 2 2 5 1938 8.000 1 1 20.000 5.000 2.000 2.000 10.000 5.000 1 4 5.000 6 4 5 6 7 7 3 4 1 3 3 1 1 1 3 0.000 0.000 1 3 10.000 2.000 0.000 1.000 5.000 6.000 9.000 1.000 0.000 1.000 3.000 1992.000 W 3.000 1 1D 15 3.000 2.000 0.000 0.000 1.000 4.000 5.000 3.000 5.000 4.000 1.000 4.000 3.000 1.000 481.000 3.000 3.000 1.000 1.000 1.000 3.000 3.000 3.000 4.000 3.000
8 2 4 2 1.000 2 2 5 1 1 5 5 1914 3.000 0 2 6.000 2.000 1.000 1.000 1.000 1.000 1 1 5.000 2 3 1 1 2 7 7 1 3 7 6 6 1 1 3 0.000 0.000 2 3 5.000 1.000 0.000 6.000 4.000 6.000 3.000 74.000 2.000 3.000 1.000 1994.000 W 4.000 9 9E 55 0.000 0.000 0.000 2.000 5.000 1.000 3.000 6.000 1.000 3.000 nan 1.000 2.000 nan 428.000 1.000 4.000 2.000 1.000 5.000 5.000 4.000 3.000 8.000 3.000
9 1 3 1 3.000 5 2 4 1 3 1 2 1959 1.000 1 3 28.000 8.000 8.000 4.000 10.000 5.000 1 9 2.000 6 4 4 1 7 4 4 5 2 3 5 3 3 4 0 0.000 0.000 2 1 20.000 3.000 0.000 1.000 4.000 2.000 9.000 1.000 0.000 1.000 5.000 1997.000 W 5.000 1 1D 15 2.000 2.000 0.000 0.000 0.000 4.000 2.000 5.000 4.000 4.000 2.000 4.000 5.000 6.000 1106.000 3.000 3.000 1.000 0.000 1.000 5.000 5.000 3.000 6.000 4.000
In [250]:
customers.tail(10)
Out[250]:
AGER_TYP ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER FINANZTYP GEBURTSJAHR GFK_URLAUBERTYP GREEN_AVANTGARDE HEALTH_TYP LP_LEBENSPHASE_FEIN LP_LEBENSPHASE_GROB LP_FAMILIE_FEIN LP_FAMILIE_GROB LP_STATUS_FEIN LP_STATUS_GROB NATIONALITAET_KZ PRAEGENDE_JUGENDJAHRE RETOURTYP_BK_S SEMIO_SOZ SEMIO_FAM SEMIO_REL SEMIO_MAT SEMIO_VERT SEMIO_LUST SEMIO_ERL SEMIO_KULT SEMIO_RAT SEMIO_KRIT SEMIO_DOM SEMIO_KAEM SEMIO_PFLICHT SEMIO_TRADV SHOPPER_TYP SOHO_KZ TITEL_KZ VERS_TYP ZABEOTYP ALTER_HH ANZ_PERSONEN ANZ_TITEL HH_EINKOMMEN_SCORE KK_KUNDENTYP W_KEIT_KIND_HH WOHNDAUER_2008 ANZ_HAUSHALTE_AKTIV ANZ_HH_TITEL GEBAEUDETYP KONSUMNAEHE MIN_GEBAEUDEJAHR OST_WEST_KZ WOHNLAGE CAMEO_DEUG_2015 CAMEO_DEU_2015 CAMEO_INTL_2015 KBA05_ANTG1 KBA05_ANTG2 KBA05_ANTG3 KBA05_ANTG4 KBA05_BAUMAX KBA05_GBZ BALLRAUM EWDICHTE INNENSTADT GEBAEUDETYP_RASTER KKK MOBI_REGIO ONLINE_AFFINITAET REGIOTYP KBA13_ANZAHL_PKW PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB
191642 2 4 2 2.000 2 1 5 2 1 5 6 1937 4.000 0 1 6.000 2.000 1.000 1.000 1.000 1.000 1 3 5.000 3 2 1 2 5 5 7 1 2 6 6 6 2 1 1 0.000 0.000 2 3 9.000 1.000 0.000 6.000 nan 6.000 9.000 32.000 0.000 3.000 3.000 1994.000 O 5.000 8 8D 55 0.000 0.000 0.000 2.000 4.000 1.000 1.000 5.000 1.000 2.000 4.000 1.000 1.000 6.000 298.000 0.000 0.000 0.000 2.000 5.000 3.000 1.000 4.000 9.000 3.000
191643 2 4 1 5.000 5 1 5 1 3 2 5 1947 11.000 1 2 40.000 12.000 11.000 5.000 10.000 5.000 1 6 5.000 6 5 2 4 7 5 7 4 1 3 3 3 2 4 3 0.000 0.000 2 1 18.000 4.000 0.000 1.000 nan 2.000 9.000 1.000 0.000 1.000 5.000 1992.000 W 1.000 1 1D 15 4.000 0.000 0.000 0.000 1.000 4.000 5.000 2.000 5.000 4.000 2.000 5.000 4.000 4.000 1400.000 4.000 2.000 0.000 0.000 1.000 5.000 5.000 2.000 4.000 1.000
191644 2 4 2 6.000 2 1 5 1 2 5 5 1956 3.000 1 2 32.000 10.000 9.000 5.000 3.000 2.000 1 9 2.000 5 3 2 1 2 7 7 1 2 6 5 6 4 1 1 0.000 0.000 2 3 13.000 1.000 0.000 5.000 2.000 4.000 9.000 7.000 0.000 8.000 1.000 1992.000 W 3.000 8 8B 41 0.000 2.000 2.000 0.000 5.000 3.000 7.000 6.000 8.000 3.000 2.000 2.000 5.000 3.000 259.000 1.000 4.000 3.000 1.000 5.000 4.000 3.000 3.000 5.000 5.000
191645 2 4 1 5.000 5 1 5 1 3 2 5 1955 11.000 0 3 36.000 12.000 11.000 5.000 6.000 3.000 1 8 5.000 6 6 2 4 7 7 3 6 3 4 3 2 4 4 3 0.000 0.000 1 6 21.000 7.000 0.000 4.000 5.000 2.000 9.000 2.000 0.000 3.000 1.000 1992.000 W 4.000 6 6B 43 2.000 2.000 0.000 0.000 0.000 3.000 2.000 4.000 5.000 4.000 2.000 4.000 5.000 3.000 485.000 2.000 3.000 2.000 1.000 1.000 3.000 3.000 3.000 5.000 3.000
191646 3 2 2 2.000 2 1 5 1 2 5 5 1957 11.000 0 2 5.000 2.000 1.000 1.000 1.000 1.000 1 8 5.000 2 5 7 3 3 2 5 6 7 7 7 7 6 7 1 0.000 0.000 2 3 13.000 1.000 0.000 5.000 nan 6.000 9.000 5.000 0.000 1.000 3.000 1993.000 W 4.000 1 1C 14 1.000 4.000 1.000 0.000 0.000 3.000 1.000 6.000 2.000 3.000 3.000 3.000 1.000 7.000 68.000 1.000 4.000 2.000 0.000 5.000 1.000 1.000 3.000 9.000 5.000
191647 1 3 1 4.000 5 1 5 1 1 2 5 0 3.000 1 3 9.000 3.000 1.000 1.000 7.000 3.000 1 4 5.000 6 5 4 6 7 6 4 5 5 5 5 3 3 4 1 0.000 0.000 2 3 9.000 1.000 0.000 1.000 nan 6.000 9.000 1.000 0.000 1.000 3.000 1992.000 W 1.000 1 1C 14 3.000 2.000 0.000 0.000 1.000 3.000 1.000 6.000 2.000 5.000 1.000 3.000 3.000 1.000 646.000 2.000 4.000 2.000 1.000 2.000 5.000 4.000 3.000 8.000 5.000
191648 -1 4 2 2.000 5 1 5 2 2 3 2 0 12.000 0 2 0.000 0.000 0.000 0.000 9.000 4.000 1 5 1.000 2 1 2 3 2 7 7 2 3 4 5 5 1 3 3 0.000 0.000 2 3 0.000 0.000 0.000 4.000 nan nan 9.000 0.000 nan 4.000 4.000 1997.000 W 3.000 5 5B 32 1.000 1.000 0.000 0.000 0.000 5.000 6.000 2.000 4.000 5.000 3.000 4.000 2.000 5.000 1600.000 nan nan nan nan nan nan nan 1.000 4.000 1.000
191649 2 4 1 2.000 5 1 5 1 1 2 5 1944 7.000 1 2 40.000 12.000 10.000 5.000 10.000 5.000 1 4 3.000 4 2 3 1 6 5 7 4 1 3 3 3 1 1 0 0.000 0.000 1 3 15.000 3.000 0.000 4.000 nan 3.000 9.000 1.000 0.000 1.000 3.000 1992.000 W 3.000 4 4D 24 4.000 1.000 0.000 0.000 1.000 4.000 7.000 6.000 8.000 4.000 3.000 5.000 2.000 4.000 642.000 3.000 2.000 2.000 1.000 1.000 5.000 5.000 3.000 7.000 5.000
191650 3 3 2 4.000 2 1 5 1 2 5 2 0 8.000 0 2 32.000 10.000 10.000 5.000 3.000 2.000 1 8 5.000 4 2 3 3 3 6 6 3 4 7 7 7 4 2 2 0.000 0.000 1 3 13.000 4.000 0.000 5.000 5.000 4.000 9.000 4.000 0.000 3.000 1.000 1992.000 W 2.000 4 4C 24 2.000 1.000 0.000 0.000 0.000 4.000 3.000 4.000 5.000 3.000 3.000 3.000 2.000 6.000 254.000 3.000 2.000 1.000 1.000 1.000 2.000 3.000 3.000 4.000 4.000
191651 3 2 1 2.000 5 1 5 1 1 2 6 1937 1.000 0 2 38.000 12.000 10.000 5.000 9.000 4.000 1 3 3.000 7 7 6 7 6 3 1 7 4 5 4 4 7 5 1 0.000 0.000 1 1 0.000 3.000 0.000 5.000 4.000 3.000 9.000 1.000 0.000 1.000 5.000 1992.000 W 7.000 5 5C 33 1.000 1.000 0.000 0.000 0.000 5.000 3.000 1.000 6.000 5.000 2.000 5.000 1.000 4.000 1145.000 3.000 2.000 0.000 0.000 1.000 4.000 5.000 1.000 3.000 1.000
In [251]:
# clean and encode
customers_cleaned_encoded = clean_data(customers)
Threshold of missing rows in the row:  30

Total nrow: 191652

Few Missing Values. Will be kept.
141725

Lots of Missing Values. Will be deleted
49927

% of rows with lots of missing values. % of data deleted 
0.260508630226
{40: [1, 2], 50: [3, 4], 60: [5, 6, 7], 70: [8, 9], 80: [10, 11, 12, 13], 90: [14, 15]}
PRAEGENDE_JUGENDJAHRE_DECADE     40     50     60     70    80    90
PRAEGENDE_JUGENDJAHRE                                               
1.000                          9931      0      0      0     0     0
2.000                         11316      0      0      0     0     0
3.000                             0  19534      0      0     0     0
4.000                             0  22216      0      0     0     0
5.000                             0      0  17167      0     0     0
6.000                             0      0  15457      0     0     0
7.000                             0      0    864      0     0     0
8.000                             0      0      0  14217     0     0
9.000                             0      0      0  11133     0     0
10.000                            0      0      0      0  5033     0
11.000                            0      0      0      0  6246     0
12.000                            0      0      0      0   714     0
13.000                            0      0      0      0   587     0
14.000                            0      0      0      0     0  3454
15.000                            0      0      0      0     0  2550
PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE      0      1
PRAEGENDE_JUGENDJAHRE_MOVEMENT_TEMP                    
                                            1306      0
AVANTGARDE                                     0  70369
MAINSTREAM                                 70050      0
PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM      0      1
PRAEGENDE_JUGENDJAHRE_MOVEMENT_TEMP                    
                                            1306      0
AVANTGARDE                                 70369      0
MAINSTREAM                                     0  70050
In [252]:
customers_cleaned_encoded.shape
Out[252]:
(141725, 70)
In [253]:
customers_cleaned_encoded.head()
Out[253]:
ALTERSKATEGORIE_GROB ANREDE_KZ FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER GREEN_AVANTGARDE HEALTH_TYP RETOURTYP_BK_S SEMIO_SOZ SEMIO_FAM SEMIO_REL SEMIO_MAT SEMIO_VERT SEMIO_LUST SEMIO_ERL SEMIO_KULT SEMIO_RAT SEMIO_KRIT SEMIO_DOM SEMIO_KAEM SEMIO_PFLICHT SEMIO_TRADV SOHO_KZ VERS_TYP ANZ_PERSONEN ANZ_TITEL HH_EINKOMMEN_SCORE W_KEIT_KIND_HH WOHNDAUER_2008 ANZ_HAUSHALTE_AKTIV ANZ_HH_TITEL KONSUMNAEHE MIN_GEBAEUDEJAHR KBA05_ANTG1 KBA05_ANTG2 KBA05_ANTG3 KBA05_ANTG4 KBA05_GBZ BALLRAUM EWDICHTE INNENSTADT GEBAEUDETYP_RASTER KKK MOBI_REGIO ONLINE_AFFINITAET REGIOTYP KBA13_ANZAHL_PKW PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB OST_WEST_KZ_O PRAEGENDE_JUGENDJAHRE_DECADE PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM CAMEO_INTL_2015_WEALTH CAMEO_INTL_2015_LIFE_STAGE_TYP PLZ8_BAUMAX_BLDNG_TYPE_BUSINESS PLZ8_BAUMAX_BLDNG_TYPE_FAMILY PLZ8_BAUMAX_FAMILY_HOMES WOHNLAGE_RURAL_FLAG WOHNLAGE_CITY_NEIGHBOURHOOD
0 4.000 1.000 5.000 1.000 5.000 1.000 2.000 2.000 1.000 1.000 5.000 6.000 5.000 2.000 6.000 6.000 7.000 3.000 4.000 1.000 3.000 1.000 1.000 2.000 1.000 0.000 1.000 2.000 0.000 1.000 6.000 9.000 1.000 0.000 5.000 1992.000 2.000 2.000 0.000 0.000 4.000 3.000 2.000 4.000 4.000 1.000 4.000 3.000 1.000 1201.000 3.000 3.000 1.000 0.000 5.000 5.000 1.000 2.000 1.000 0 50 1 0 1 3 0 1 1.000 1 0.000
2 4.000 2.000 5.000 1.000 5.000 1.000 4.000 4.000 1.000 2.000 5.000 2.000 2.000 1.000 3.000 3.000 7.000 7.000 1.000 2.000 7.000 5.000 6.000 4.000 1.000 0.000 2.000 1.000 0.000 1.000 6.000 9.000 1.000 0.000 1.000 1992.000 2.000 2.000 0.000 0.000 3.000 7.000 4.000 1.000 3.000 3.000 3.000 1.000 7.000 433.000 2.000 3.000 3.000 1.000 3.000 2.000 3.000 5.000 3.000 0 50 1 0 3 4 0 1 3.000 0 2.000
3 4.000 1.000 5.000 1.000 5.000 2.000 1.000 2.000 0.000 2.000 3.000 6.000 5.000 3.000 4.000 7.000 5.000 3.000 4.000 3.000 3.000 3.000 3.000 3.000 4.000 0.000 1.000 0.000 0.000 4.000 nan 9.000 nan nan 2.000 1992.000 3.000 0.000 0.000 0.000 4.000 7.000 1.000 7.000 4.000 3.000 4.000 2.000 6.000 755.000 3.000 2.000 1.000 0.000 3.000 4.000 1.000 3.000 1.000 0 40 0 1 2 4 0 1 1.000 1 0.000
4 3.000 1.000 3.000 1.000 4.000 4.000 5.000 2.000 0.000 3.000 5.000 4.000 5.000 4.000 6.000 5.000 6.000 4.000 5.000 5.000 3.000 5.000 2.000 5.000 4.000 0.000 2.000 4.000 0.000 6.000 2.000 9.000 7.000 0.000 1.000 1992.000 0.000 3.000 2.000 0.000 3.000 3.000 4.000 4.000 3.000 4.000 3.000 5.000 7.000 513.000 2.000 4.000 2.000 1.000 3.000 3.000 3.000 5.000 1.000 0 70 0 1 4 1 0 1 2.000 0 3.000
5 3.000 1.000 5.000 1.000 5.000 1.000 2.000 3.000 1.000 3.000 3.000 6.000 4.000 4.000 1.000 7.000 6.000 4.000 6.000 2.000 5.000 5.000 3.000 3.000 4.000 0.000 2.000 2.000 0.000 1.000 6.000 9.000 1.000 0.000 2.000 1992.000 2.000 2.000 1.000 0.000 3.000 7.000 5.000 8.000 4.000 2.000 3.000 3.000 3.000 1167.000 2.000 3.000 2.000 1.000 5.000 5.000 3.000 7.000 5.000 0 50 1 0 3 4 0 1 1.000 0 1.000
10% of data in the general population was dropped due to lots (>30) of missing values in rows. 25% of customers data was dropped. This is important to note. What does it mean? Does it mean we know less about our customers than on avg about the general population?
Our assumption was that customers data is sort of like general population data. The proportion of rows with large amount of NANs is one thing I've found how they're different. There may be more differences.
In [254]:
customers_cleaned_encoded.isnull().sum()
Out[254]:
ALTERSKATEGORIE_GROB                           233
ANREDE_KZ                                        0
FINANZ_MINIMALIST                                0
FINANZ_SPARER                                    0
FINANZ_VORSORGER                                 0
FINANZ_ANLEGER                                   0
FINANZ_UNAUFFAELLIGER                            0
FINANZ_HAUSBAUER                                 0
GREEN_AVANTGARDE                                 0
HEALTH_TYP                                    2339
RETOURTYP_BK_S                                3124
SEMIO_SOZ                                        0
SEMIO_FAM                                        0
SEMIO_REL                                        0
SEMIO_MAT                                        0
SEMIO_VERT                                       0
SEMIO_LUST                                       0
SEMIO_ERL                                        0
SEMIO_KULT                                       0
SEMIO_RAT                                        0
SEMIO_KRIT                                       0
SEMIO_DOM                                        0
SEMIO_KAEM                                       0
SEMIO_PFLICHT                                    0
SEMIO_TRADV                                      0
SOHO_KZ                                          0
VERS_TYP                                      2339
ANZ_PERSONEN                                     0
ANZ_TITEL                                        0
HH_EINKOMMEN_SCORE                               0
W_KEIT_KIND_HH                                7939
WOHNDAUER_2008                                   0
ANZ_HAUSHALTE_AKTIV                           2450
ANZ_HH_TITEL                                  2183
KONSUMNAEHE                                      6
MIN_GEBAEUDEJAHR                                 0
KBA05_ANTG1                                   6053
KBA05_ANTG2                                   6053
KBA05_ANTG3                                   6053
KBA05_ANTG4                                   6053
KBA05_GBZ                                     6055
BALLRAUM                                        32
EWDICHTE                                        32
INNENSTADT                                      32
GEBAEUDETYP_RASTER                               0
KKK                                          10137
MOBI_REGIO                                    6053
ONLINE_AFFINITAET                             3124
REGIOTYP                                     10137
KBA13_ANZAHL_PKW                              1354
PLZ8_ANTG1                                    2837
PLZ8_ANTG2                                    2837
PLZ8_ANTG3                                    2837
PLZ8_ANTG4                                    2837
PLZ8_HHZ                                      2837
PLZ8_GBZ                                      2837
ARBEIT                                         572
ORTSGR_KLS9                                    549
RELAT_AB                                       572
OST_WEST_KZ_O                                    0
PRAEGENDE_JUGENDJAHRE_DECADE                     0
PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE        0
PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM        0
CAMEO_INTL_2015_WEALTH                         627
CAMEO_INTL_2015_LIFE_STAGE_TYP                 627
PLZ8_BAUMAX_BLDNG_TYPE_BUSINESS                  0
PLZ8_BAUMAX_BLDNG_TYPE_FAMILY                    0
PLZ8_BAUMAX_FAMILY_HOMES                      2837
WOHNLAGE_RURAL_FLAG                              0
WOHNLAGE_CITY_NEIGHBOURHOOD                      0
dtype: int64
In [255]:
# number of columns with missing values
# this number was 34 for azdias as far as i remember
customers_cleaned_encoded.isnull().any().sum()
Out[255]:
33
In [256]:
# Imputer produces an array, not a dataframe. I'll lose my columns I'll need to rebuild the df.
# So I need to save my column names
columns_list_customers = list(customers_cleaned_encoded.columns)
columns_list_customers
Out[256]:
['ALTERSKATEGORIE_GROB',
 'ANREDE_KZ',
 'FINANZ_MINIMALIST',
 'FINANZ_SPARER',
 'FINANZ_VORSORGER',
 'FINANZ_ANLEGER',
 'FINANZ_UNAUFFAELLIGER',
 'FINANZ_HAUSBAUER',
 'GREEN_AVANTGARDE',
 'HEALTH_TYP',
 'RETOURTYP_BK_S',
 'SEMIO_SOZ',
 'SEMIO_FAM',
 'SEMIO_REL',
 'SEMIO_MAT',
 'SEMIO_VERT',
 'SEMIO_LUST',
 'SEMIO_ERL',
 'SEMIO_KULT',
 'SEMIO_RAT',
 'SEMIO_KRIT',
 'SEMIO_DOM',
 'SEMIO_KAEM',
 'SEMIO_PFLICHT',
 'SEMIO_TRADV',
 'SOHO_KZ',
 'VERS_TYP',
 'ANZ_PERSONEN',
 'ANZ_TITEL',
 'HH_EINKOMMEN_SCORE',
 'W_KEIT_KIND_HH',
 'WOHNDAUER_2008',
 'ANZ_HAUSHALTE_AKTIV',
 'ANZ_HH_TITEL',
 'KONSUMNAEHE',
 'MIN_GEBAEUDEJAHR',
 'KBA05_ANTG1',
 'KBA05_ANTG2',
 'KBA05_ANTG3',
 'KBA05_ANTG4',
 'KBA05_GBZ',
 'BALLRAUM',
 'EWDICHTE',
 'INNENSTADT',
 'GEBAEUDETYP_RASTER',
 'KKK',
 'MOBI_REGIO',
 'ONLINE_AFFINITAET',
 'REGIOTYP',
 'KBA13_ANZAHL_PKW',
 'PLZ8_ANTG1',
 'PLZ8_ANTG2',
 'PLZ8_ANTG3',
 'PLZ8_ANTG4',
 'PLZ8_HHZ',
 'PLZ8_GBZ',
 'ARBEIT',
 'ORTSGR_KLS9',
 'RELAT_AB',
 'OST_WEST_KZ_O',
 'PRAEGENDE_JUGENDJAHRE_DECADE',
 'PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE',
 'PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM',
 'CAMEO_INTL_2015_WEALTH',
 'CAMEO_INTL_2015_LIFE_STAGE_TYP',
 'PLZ8_BAUMAX_BLDNG_TYPE_BUSINESS',
 'PLZ8_BAUMAX_BLDNG_TYPE_FAMILY',
 'PLZ8_BAUMAX_FAMILY_HOMES',
 'WOHNLAGE_RURAL_FLAG',
 'WOHNLAGE_CITY_NEIGHBOURHOOD']
In [257]:
# impute
customers_cleaned_encoded_imputed = imputer.transform(customers_cleaned_encoded)

print(type(customers_cleaned_encoded_imputed))

customers_cleaned_encoded_imputed = pd.DataFrame(customers_cleaned_encoded_imputed,columns=columns_list_customers)

print(type(customers_cleaned_encoded_imputed))

customers_cleaned_encoded_imputed.head()
<class 'numpy.ndarray'>
<class 'pandas.core.frame.DataFrame'>
Out[257]:
ALTERSKATEGORIE_GROB ANREDE_KZ FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER GREEN_AVANTGARDE HEALTH_TYP RETOURTYP_BK_S SEMIO_SOZ SEMIO_FAM SEMIO_REL SEMIO_MAT SEMIO_VERT SEMIO_LUST SEMIO_ERL SEMIO_KULT SEMIO_RAT SEMIO_KRIT SEMIO_DOM SEMIO_KAEM SEMIO_PFLICHT SEMIO_TRADV SOHO_KZ VERS_TYP ANZ_PERSONEN ANZ_TITEL HH_EINKOMMEN_SCORE W_KEIT_KIND_HH WOHNDAUER_2008 ANZ_HAUSHALTE_AKTIV ANZ_HH_TITEL KONSUMNAEHE MIN_GEBAEUDEJAHR KBA05_ANTG1 KBA05_ANTG2 KBA05_ANTG3 KBA05_ANTG4 KBA05_GBZ BALLRAUM EWDICHTE INNENSTADT GEBAEUDETYP_RASTER KKK MOBI_REGIO ONLINE_AFFINITAET REGIOTYP KBA13_ANZAHL_PKW PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB OST_WEST_KZ_O PRAEGENDE_JUGENDJAHRE_DECADE PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM CAMEO_INTL_2015_WEALTH CAMEO_INTL_2015_LIFE_STAGE_TYP PLZ8_BAUMAX_BLDNG_TYPE_BUSINESS PLZ8_BAUMAX_BLDNG_TYPE_FAMILY PLZ8_BAUMAX_FAMILY_HOMES WOHNLAGE_RURAL_FLAG WOHNLAGE_CITY_NEIGHBOURHOOD
0 4.000 1.000 5.000 1.000 5.000 1.000 2.000 2.000 1.000 1.000 5.000 6.000 5.000 2.000 6.000 6.000 7.000 3.000 4.000 1.000 3.000 1.000 1.000 2.000 1.000 0.000 1.000 2.000 0.000 1.000 6.000 9.000 1.000 0.000 5.000 1992.000 2.000 2.000 0.000 0.000 4.000 3.000 2.000 4.000 4.000 1.000 4.000 3.000 1.000 1201.000 3.000 3.000 1.000 0.000 5.000 5.000 1.000 2.000 1.000 0.000 50.000 1.000 0.000 1.000 3.000 0.000 1.000 1.000 1.000 0.000
1 4.000 2.000 5.000 1.000 5.000 1.000 4.000 4.000 1.000 2.000 5.000 2.000 2.000 1.000 3.000 3.000 7.000 7.000 1.000 2.000 7.000 5.000 6.000 4.000 1.000 0.000 2.000 1.000 0.000 1.000 6.000 9.000 1.000 0.000 1.000 1992.000 2.000 2.000 0.000 0.000 3.000 7.000 4.000 1.000 3.000 3.000 3.000 1.000 7.000 433.000 2.000 3.000 3.000 1.000 3.000 2.000 3.000 5.000 3.000 0.000 50.000 1.000 0.000 3.000 4.000 0.000 1.000 3.000 0.000 2.000
2 4.000 1.000 5.000 1.000 5.000 2.000 1.000 2.000 0.000 2.000 3.000 6.000 5.000 3.000 4.000 7.000 5.000 3.000 4.000 3.000 3.000 3.000 3.000 3.000 4.000 0.000 1.000 0.000 0.000 4.000 6.000 9.000 1.000 0.000 2.000 1992.000 3.000 0.000 0.000 0.000 4.000 7.000 1.000 7.000 4.000 3.000 4.000 2.000 6.000 755.000 3.000 2.000 1.000 0.000 3.000 4.000 1.000 3.000 1.000 0.000 40.000 0.000 1.000 2.000 4.000 0.000 1.000 1.000 1.000 0.000
3 3.000 1.000 3.000 1.000 4.000 4.000 5.000 2.000 0.000 3.000 5.000 4.000 5.000 4.000 6.000 5.000 6.000 4.000 5.000 5.000 3.000 5.000 2.000 5.000 4.000 0.000 2.000 4.000 0.000 6.000 2.000 9.000 7.000 0.000 1.000 1992.000 0.000 3.000 2.000 0.000 3.000 3.000 4.000 4.000 3.000 4.000 3.000 5.000 7.000 513.000 2.000 4.000 2.000 1.000 3.000 3.000 3.000 5.000 1.000 0.000 70.000 0.000 1.000 4.000 1.000 0.000 1.000 2.000 0.000 3.000
4 3.000 1.000 5.000 1.000 5.000 1.000 2.000 3.000 1.000 3.000 3.000 6.000 4.000 4.000 1.000 7.000 6.000 4.000 6.000 2.000 5.000 5.000 3.000 3.000 4.000 0.000 2.000 2.000 0.000 1.000 6.000 9.000 1.000 0.000 2.000 1992.000 2.000 2.000 1.000 0.000 3.000 7.000 5.000 8.000 4.000 2.000 3.000 3.000 3.000 1167.000 2.000 3.000 2.000 1.000 5.000 5.000 3.000 7.000 5.000 0.000 50.000 1.000 0.000 3.000 4.000 0.000 1.000 1.000 0.000 1.000
In [258]:
# we now should have no NANs
customers_cleaned_encoded_imputed.isnull().any().sum()
Out[258]:
0
In [259]:
customers_cleaned_encoded_imputed.isnull().sum()
Out[259]:
ALTERSKATEGORIE_GROB                         0
ANREDE_KZ                                    0
FINANZ_MINIMALIST                            0
FINANZ_SPARER                                0
FINANZ_VORSORGER                             0
FINANZ_ANLEGER                               0
FINANZ_UNAUFFAELLIGER                        0
FINANZ_HAUSBAUER                             0
GREEN_AVANTGARDE                             0
HEALTH_TYP                                   0
RETOURTYP_BK_S                               0
SEMIO_SOZ                                    0
SEMIO_FAM                                    0
SEMIO_REL                                    0
SEMIO_MAT                                    0
SEMIO_VERT                                   0
SEMIO_LUST                                   0
SEMIO_ERL                                    0
SEMIO_KULT                                   0
SEMIO_RAT                                    0
SEMIO_KRIT                                   0
SEMIO_DOM                                    0
SEMIO_KAEM                                   0
SEMIO_PFLICHT                                0
SEMIO_TRADV                                  0
SOHO_KZ                                      0
VERS_TYP                                     0
ANZ_PERSONEN                                 0
ANZ_TITEL                                    0
HH_EINKOMMEN_SCORE                           0
W_KEIT_KIND_HH                               0
WOHNDAUER_2008                               0
ANZ_HAUSHALTE_AKTIV                          0
ANZ_HH_TITEL                                 0
KONSUMNAEHE                                  0
MIN_GEBAEUDEJAHR                             0
KBA05_ANTG1                                  0
KBA05_ANTG2                                  0
KBA05_ANTG3                                  0
KBA05_ANTG4                                  0
KBA05_GBZ                                    0
BALLRAUM                                     0
EWDICHTE                                     0
INNENSTADT                                   0
GEBAEUDETYP_RASTER                           0
KKK                                          0
MOBI_REGIO                                   0
ONLINE_AFFINITAET                            0
REGIOTYP                                     0
KBA13_ANZAHL_PKW                             0
PLZ8_ANTG1                                   0
PLZ8_ANTG2                                   0
PLZ8_ANTG3                                   0
PLZ8_ANTG4                                   0
PLZ8_HHZ                                     0
PLZ8_GBZ                                     0
ARBEIT                                       0
ORTSGR_KLS9                                  0
RELAT_AB                                     0
OST_WEST_KZ_O                                0
PRAEGENDE_JUGENDJAHRE_DECADE                 0
PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE    0
PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM    0
CAMEO_INTL_2015_WEALTH                       0
CAMEO_INTL_2015_LIFE_STAGE_TYP               0
PLZ8_BAUMAX_BLDNG_TYPE_BUSINESS              0
PLZ8_BAUMAX_BLDNG_TYPE_FAMILY                0
PLZ8_BAUMAX_FAMILY_HOMES                     0
WOHNLAGE_RURAL_FLAG                          0
WOHNLAGE_CITY_NEIGHBOURHOOD                  0
dtype: int64
In [260]:
# checking how mode imputation worked

print(customers_cleaned_encoded['KKK'].isna().sum())
print(customers_cleaned_encoded['KKK'].value_counts())

print(customers_cleaned_encoded_imputed['KKK'].isna().sum())
print(customers_cleaned_encoded_imputed['KKK'].value_counts())
10137
3.000    40739
2.000    40049
1.000    28850
4.000    21950
Name: KKK, dtype: int64
0
3.000    50876
2.000    40049
1.000    28850
4.000    21950
Name: KKK, dtype: int64
In [261]:
# 'KKK'
# 10137+40739 = 50876
10137+40739
Out[261]:
50876
In [262]:
# checking how mode imputation worked

print(customers_cleaned_encoded['ALTERSKATEGORIE_GROB'].isna().sum())
print(customers_cleaned_encoded['ALTERSKATEGORIE_GROB'].value_counts())

print(customers_cleaned_encoded_imputed['ALTERSKATEGORIE_GROB'].isna().sum())
print(customers_cleaned_encoded_imputed['ALTERSKATEGORIE_GROB'].value_counts())
233
4.000    83996
3.000    47377
2.000     5153
1.000     4966
Name: ALTERSKATEGORIE_GROB, dtype: int64
0
4.000    83996
3.000    47610
2.000     5153
1.000     4966
Name: ALTERSKATEGORIE_GROB, dtype: int64
We didn't refit the imputer to customers data. So ALTERSKATEGORIE_GROB were imputed with not 4, mode for customers, but 3, mode for general population. Now I understand better what they meant by the following

I use the sklearn objects from the general demographics data, and apply their transformations to the customers data. That is, we should not be using a .fit() or .fit_transform() method to re-fit the old objects, nor should we be creating new sklearn objects! We should carry the data through the feature scaling, PCA, and clustering steps, obtaining cluster assignments for all of the data in the customer demographics data.

In [263]:
233 + 47377
Out[263]:
47610
In [264]:
azdias_cleaned_encoded.ALTERSKATEGORIE_GROB.value_counts()
Out[264]:
3.000    310466
4.000    223265
2.000    137100
1.000    124433
Name: ALTERSKATEGORIE_GROB, dtype: int64
In [265]:
customers_cleaned_encoded.ALTERSKATEGORIE_GROB.value_counts()
Out[265]:
4.000    83996
3.000    47377
2.000     5153
1.000     4966
Name: ALTERSKATEGORIE_GROB, dtype: int64
In [266]:
# scale

customers_scaled = scaler.transform(customers_cleaned_encoded_imputed)

print(type(customers_scaled))

customers_scaled = pd.DataFrame(customers_scaled, columns=columns_list_customers)

print(type(customers_scaled))

customers_scaled.head()
<class 'numpy.ndarray'>
<class 'pandas.core.frame.DataFrame'>
Out[266]:
ALTERSKATEGORIE_GROB ANREDE_KZ FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER GREEN_AVANTGARDE HEALTH_TYP RETOURTYP_BK_S SEMIO_SOZ SEMIO_FAM SEMIO_REL SEMIO_MAT SEMIO_VERT SEMIO_LUST SEMIO_ERL SEMIO_KULT SEMIO_RAT SEMIO_KRIT SEMIO_DOM SEMIO_KAEM SEMIO_PFLICHT SEMIO_TRADV SOHO_KZ VERS_TYP ANZ_PERSONEN ANZ_TITEL HH_EINKOMMEN_SCORE W_KEIT_KIND_HH WOHNDAUER_2008 ANZ_HAUSHALTE_AKTIV ANZ_HH_TITEL KONSUMNAEHE MIN_GEBAEUDEJAHR KBA05_ANTG1 KBA05_ANTG2 KBA05_ANTG3 KBA05_ANTG4 KBA05_GBZ BALLRAUM EWDICHTE INNENSTADT GEBAEUDETYP_RASTER KKK MOBI_REGIO ONLINE_AFFINITAET REGIOTYP KBA13_ANZAHL_PKW PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB OST_WEST_KZ_O PRAEGENDE_JUGENDJAHRE_DECADE PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM CAMEO_INTL_2015_WEALTH CAMEO_INTL_2015_LIFE_STAGE_TYP PLZ8_BAUMAX_BLDNG_TYPE_BUSINESS PLZ8_BAUMAX_BLDNG_TYPE_FAMILY PLZ8_BAUMAX_FAMILY_HOMES WOHNLAGE_RURAL_FLAG WOHNLAGE_CITY_NEIGHBOURHOOD
0 1.184 -1.044 1.409 -1.156 1.138 -1.250 -0.470 -0.791 1.885 -1.634 1.064 0.958 0.463 -1.044 1.104 0.885 1.268 -0.888 -0.067 -1.751 -0.880 -1.946 -1.764 -1.178 -1.543 -0.092 -1.083 0.234 -0.060 -2.209 0.957 0.567 -0.467 -0.125 1.275 -0.383 0.413 0.641 -0.595 -0.464 0.655 -0.529 -1.129 -0.271 0.283 -1.853 0.779 0.167 -1.988 1.620 0.787 0.212 -0.623 -0.935 1.456 1.486 -2.175 -1.433 -1.526 -0.517 -1.045 1.885 -1.707 -1.551 0.095 -0.373 0.422 -0.305 1.809 -1.455
1 1.184 0.958 1.409 -1.156 1.138 -1.250 0.959 0.629 1.885 -0.312 1.064 -1.102 -1.104 -1.567 -0.464 -0.657 1.268 1.302 -1.600 -1.146 1.392 0.244 0.913 -0.100 -1.543 -0.092 0.923 -0.630 -0.060 -2.209 0.957 0.567 -0.467 -0.125 -1.305 -0.383 0.413 0.641 -0.595 -0.464 -0.116 1.303 0.034 -1.750 -0.800 0.270 0.093 -1.120 1.329 -0.567 -0.257 0.212 1.429 0.443 -0.616 -1.249 -0.171 -0.127 -0.052 -0.517 -1.045 1.885 -1.707 -0.188 0.767 -0.373 0.422 1.685 -0.553 -0.239
2 1.184 -1.044 1.409 -1.156 1.138 -0.571 -1.185 -0.791 -0.530 -0.312 -0.311 0.958 0.463 -0.521 0.059 1.399 0.317 -0.888 -0.067 -0.541 -0.880 -0.851 -0.693 -0.639 0.156 -0.092 -1.083 -1.495 -0.060 -0.268 0.957 0.567 -0.467 -0.125 -0.660 -0.383 1.124 -0.966 -0.595 -0.464 0.655 1.303 -1.710 1.208 0.283 0.270 0.779 -0.476 0.776 0.350 0.787 -0.890 -0.623 -0.935 -0.616 0.574 -2.175 -0.998 -1.526 -0.517 -1.550 -0.530 0.586 -0.870 0.767 -0.373 0.422 -0.305 1.809 -1.455
3 0.201 -1.044 -0.043 -1.156 0.412 0.787 1.673 -0.791 -0.530 1.010 1.064 -0.072 0.463 0.003 1.104 0.371 0.792 -0.341 0.443 0.669 -0.880 0.244 -1.228 0.440 0.156 -0.092 0.923 1.964 -0.060 1.027 -1.293 0.567 -0.083 -0.125 -1.305 -0.383 -1.009 1.445 1.411 -0.464 -0.116 -0.529 0.034 -0.271 -0.800 1.331 0.093 1.454 1.329 -0.339 -0.257 1.314 0.403 0.443 -0.616 -0.337 -0.171 -0.127 -1.526 -0.517 -0.034 -0.530 0.586 0.494 -1.249 -0.373 0.422 0.690 -0.553 0.369
4 0.201 -1.044 1.409 -1.156 1.138 -1.250 -0.470 -0.081 1.885 1.010 -0.311 0.958 -0.059 0.003 -1.509 1.399 0.792 -0.341 0.954 -1.146 0.256 0.244 -0.693 -0.639 0.156 -0.092 0.923 0.234 -0.060 -2.209 0.957 0.567 -0.467 -0.125 -0.660 -0.383 0.413 0.641 0.408 -0.464 -0.116 1.303 0.616 1.701 0.283 -0.792 0.093 0.167 -0.882 1.523 -0.257 0.212 0.403 0.443 1.456 1.486 -0.171 0.743 1.422 -0.517 -1.045 1.885 -1.707 -0.188 0.767 -0.373 0.422 -0.305 -0.553 -0.847
In [267]:
customers_cleaned_encoded_imputed.describe().transpose()
Out[267]:
count mean std min 25% 50% 75% max
ALTERSKATEGORIE_GROB 141725.000 3.486 0.730 1.000 3.000 4.000 4.000 4.000
ANREDE_KZ 141725.000 1.331 0.470 1.000 1.000 1.000 2.000 2.000
FINANZ_MINIMALIST 141725.000 4.264 1.018 1.000 4.000 5.000 5.000 5.000
FINANZ_SPARER 141725.000 1.431 0.834 1.000 1.000 1.000 2.000 5.000
FINANZ_VORSORGER 141725.000 4.563 0.846 1.000 4.000 5.000 5.000 5.000
FINANZ_ANLEGER 141725.000 1.602 1.003 1.000 1.000 1.000 2.000 5.000
FINANZ_UNAUFFAELLIGER 141725.000 1.866 0.979 1.000 1.000 2.000 2.000 5.000
FINANZ_HAUSBAUER 141725.000 2.728 1.317 1.000 2.000 2.000 4.000 5.000
GREEN_AVANTGARDE 141725.000 0.497 0.500 0.000 0.000 0.000 1.000 1.000
HEALTH_TYP 141725.000 1.979 0.778 1.000 1.000 2.000 3.000 3.000
RETOURTYP_BK_S 141725.000 3.941 1.183 1.000 3.000 4.000 5.000 5.000
SEMIO_SOZ 141725.000 4.328 1.619 1.000 3.000 4.000 6.000 7.000
SEMIO_FAM 141725.000 3.894 1.683 1.000 2.000 4.000 5.000 7.000
SEMIO_REL 141725.000 3.188 1.460 1.000 2.000 3.000 4.000 7.000
SEMIO_MAT 141725.000 3.488 1.982 1.000 1.000 4.000 6.000 7.000
SEMIO_VERT 141725.000 5.194 1.761 1.000 4.000 6.000 7.000 7.000
SEMIO_LUST 141725.000 5.516 1.635 1.000 4.000 6.000 7.000 7.000
SEMIO_ERL 141725.000 4.871 1.752 1.000 3.000 4.000 7.000 7.000
SEMIO_KULT 141725.000 3.913 1.753 1.000 3.000 4.000 5.000 7.000
SEMIO_RAT 141725.000 2.889 1.401 1.000 2.000 3.000 4.000 7.000
SEMIO_KRIT 141725.000 3.915 1.779 1.000 3.000 3.000 5.000 7.000
SEMIO_DOM 141725.000 4.000 1.592 1.000 3.000 4.000 5.000 7.000
SEMIO_KAEM 141725.000 3.598 1.794 1.000 2.000 3.000 5.000 7.000
SEMIO_PFLICHT 141725.000 3.031 1.432 1.000 2.000 3.000 4.000 7.000
SEMIO_TRADV 141725.000 2.907 1.393 1.000 2.000 3.000 4.000 7.000
SOHO_KZ 141725.000 0.010 0.099 0.000 0.000 0.000 0.000 1.000
VERS_TYP 141725.000 1.502 0.500 1.000 1.000 2.000 2.000 2.000
ANZ_PERSONEN 141725.000 2.269 1.391 0.000 1.000 2.000 3.000 21.000
ANZ_TITEL 141725.000 0.020 0.152 0.000 0.000 0.000 0.000 5.000
HH_EINKOMMEN_SCORE 141725.000 3.254 1.655 1.000 2.000 3.000 5.000 6.000
W_KEIT_KIND_HH 141725.000 4.366 1.873 1.000 2.000 6.000 6.000 6.000
WOHNDAUER_2008 141725.000 8.647 1.153 1.000 9.000 9.000 9.000 9.000
ANZ_HAUSHALTE_AKTIV 141725.000 4.983 14.304 1.000 1.000 1.000 4.000 523.000
ANZ_HH_TITEL 141725.000 0.066 0.541 0.000 0.000 0.000 0.000 23.000
KONSUMNAEHE 141725.000 3.138 1.440 1.000 2.000 3.000 4.000 7.000
MIN_GEBAEUDEJAHR 141725.000 1993.057 3.080 1985.000 1992.000 1992.000 1992.000 2016.000
KBA05_ANTG1 141725.000 2.113 1.425 0.000 1.000 2.000 3.000 4.000
KBA05_ANTG2 141725.000 1.157 1.091 0.000 0.000 1.000 2.000 4.000
KBA05_ANTG3 141725.000 0.297 0.751 0.000 0.000 0.000 0.000 3.000
KBA05_ANTG4 141725.000 0.147 0.468 0.000 0.000 0.000 0.000 2.000
KBA05_GBZ 141725.000 3.604 1.150 1.000 3.000 4.000 5.000 5.000
BALLRAUM 141725.000 4.302 2.115 1.000 2.000 5.000 6.000 7.000
EWDICHTE 141725.000 3.882 1.608 1.000 2.000 4.000 5.000 6.000
INNENSTADT 141725.000 4.785 1.961 1.000 3.000 5.000 6.000 8.000
GEBAEUDETYP_RASTER 141725.000 3.853 0.830 1.000 3.000 4.000 4.000 5.000
KKK 141725.000 2.465 0.983 1.000 2.000 3.000 3.000 4.000
MOBI_REGIO 141725.000 3.515 1.363 1.000 3.000 4.000 5.000 6.000
ONLINE_AFFINITAET 141725.000 2.997 1.352 0.000 2.000 3.000 4.000 5.000
REGIOTYP 141725.000 4.127 1.955 1.000 2.000 5.000 6.000 7.000
KBA13_ANZAHL_PKW 141725.000 674.232 346.267 5.000 432.000 596.000 838.000 2300.000
PLZ8_ANTG1 141725.000 2.527 0.899 0.000 2.000 3.000 3.000 4.000
PLZ8_ANTG2 141725.000 2.737 0.833 0.000 2.000 3.000 3.000 4.000
PLZ8_ANTG3 141725.000 1.401 0.880 0.000 1.000 1.000 2.000 3.000
PLZ8_ANTG4 141725.000 0.529 0.635 0.000 0.000 0.000 1.000 2.000
PLZ8_HHZ 141725.000 3.622 0.929 1.000 3.000 3.000 4.000 5.000
PLZ8_GBZ 141725.000 3.610 1.002 1.000 3.000 4.000 4.000 5.000
ARBEIT 141725.000 2.829 1.010 1.000 2.000 3.000 4.000 5.000
ORTSGR_KLS9 141725.000 5.119 2.155 1.000 4.000 5.000 7.000 9.000
RELAT_AB 141725.000 2.898 1.418 1.000 2.000 3.000 4.000 5.000
OST_WEST_KZ_O 141725.000 0.080 0.271 0.000 0.000 0.000 0.000 1.000
PRAEGENDE_JUGENDJAHRE_DECADE 141725.000 58.338 14.538 0.000 50.000 60.000 70.000 90.000
PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE 141725.000 0.497 0.500 0.000 0.000 0.000 1.000 1.000
PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM 141725.000 0.494 0.500 0.000 0.000 0.000 1.000 1.000
CAMEO_INTL_2015_WEALTH 141725.000 2.609 1.411 1.000 1.000 2.000 4.000 5.000
CAMEO_INTL_2015_LIFE_STAGE_TYP 141725.000 3.358 1.347 1.000 2.000 4.000 4.000 5.000
PLZ8_BAUMAX_BLDNG_TYPE_BUSINESS 141725.000 0.072 0.258 0.000 0.000 0.000 0.000 1.000
PLZ8_BAUMAX_BLDNG_TYPE_FAMILY 141725.000 0.908 0.289 0.000 1.000 1.000 1.000 1.000
PLZ8_BAUMAX_FAMILY_HOMES 141725.000 1.187 0.755 0.000 1.000 1.000 1.000 4.000
WOHNLAGE_RURAL_FLAG 141725.000 0.237 0.425 0.000 0.000 0.000 0.000 1.000
WOHNLAGE_CITY_NEIGHBOURHOOD 141725.000 2.050 1.487 0.000 1.000 2.000 3.000 5.000
In [268]:
# notice how scaling didn't do its job here
customers_scaled.describe().transpose()
Out[268]:
count mean std min 25% 50% 75% max
ALTERSKATEGORIE_GROB 141725.000 0.679 0.718 -1.767 0.201 1.184 1.184 1.184
ANREDE_KZ 141725.000 -0.382 0.942 -1.044 -1.044 -1.044 0.958 0.958
FINANZ_MINIMALIST 141725.000 0.875 0.739 -1.495 0.683 1.409 1.409 1.409
FINANZ_SPARER 141725.000 -0.865 0.562 -1.156 -1.156 -1.156 -0.482 1.538
FINANZ_VORSORGER 141725.000 0.821 0.615 -1.767 0.412 1.138 1.138 1.138
FINANZ_ANLEGER 141725.000 -0.841 0.681 -1.250 -1.250 -1.250 -0.571 1.466
FINANZ_UNAUFFAELLIGER 141725.000 -0.566 0.699 -1.185 -1.185 -0.470 -0.470 1.673
FINANZ_HAUSBAUER 141725.000 -0.274 0.935 -1.501 -0.791 -0.791 0.629 1.339
GREEN_AVANTGARDE 141725.000 0.669 1.208 -0.530 -0.530 -0.530 1.885 1.885
HEALTH_TYP 141725.000 -0.339 1.028 -1.634 -1.634 -0.312 1.010 1.010
RETOURTYP_BK_S 141725.000 0.336 0.813 -1.685 -0.311 0.376 1.064 1.064
SEMIO_SOZ 141725.000 0.097 0.834 -1.618 -0.587 -0.072 0.958 1.474
SEMIO_FAM 141725.000 -0.115 0.880 -1.627 -1.104 -0.059 0.463 1.508
SEMIO_REL 141725.000 -0.422 0.764 -1.567 -1.044 -0.521 0.003 1.573
SEMIO_MAT 141725.000 -0.209 1.036 -1.509 -1.509 0.059 1.104 1.627
SEMIO_VERT 141725.000 0.471 0.905 -1.685 -0.143 0.885 1.399 1.399
SEMIO_LUST 141725.000 0.562 0.777 -1.585 -0.159 0.792 1.268 1.268
SEMIO_ERL 141725.000 0.136 0.959 -1.983 -0.888 -0.341 1.302 1.302
SEMIO_KULT 141725.000 -0.112 0.895 -1.600 -0.578 -0.067 0.443 1.465
SEMIO_RAT 141725.000 -0.608 0.848 -1.751 -1.146 -0.541 0.064 1.879
SEMIO_KRIT 141725.000 -0.360 1.011 -2.016 -0.880 -0.880 0.256 1.392
SEMIO_DOM 141725.000 -0.303 0.872 -1.946 -0.851 -0.304 0.244 1.339
SEMIO_KAEM 141725.000 -0.373 0.960 -1.764 -1.228 -0.693 0.378 1.449
SEMIO_PFLICHT 141725.000 -0.622 0.772 -1.718 -1.178 -0.639 -0.100 1.519
SEMIO_TRADV 141725.000 -0.463 0.789 -1.543 -0.977 -0.410 0.156 1.855
SOHO_KZ 141725.000 0.016 1.081 -0.092 -0.092 -0.092 -0.092 10.854
VERS_TYP 141725.000 -0.077 1.003 -1.083 -1.083 0.923 0.923 0.923
ANZ_PERSONEN 141725.000 0.467 1.203 -1.495 -0.630 0.234 1.099 16.663
ANZ_TITEL 141725.000 0.235 2.210 -0.060 -0.060 -0.060 -0.060 72.522
HH_EINKOMMEN_SCORE 141725.000 -0.750 1.071 -2.209 -1.562 -0.915 0.380 1.027
W_KEIT_KIND_HH 141725.000 0.038 1.054 -1.855 -1.293 0.957 0.957 0.957
WOHNDAUER_2008 141725.000 0.384 0.599 -3.593 0.567 0.567 0.567 0.567
ANZ_HAUSHALTE_AKTIV 141725.000 -0.212 0.916 -0.467 -0.467 -0.467 -0.275 32.943
ANZ_HH_TITEL 141725.000 0.080 1.675 -0.125 -0.125 -0.125 -0.125 71.026
KONSUMNAEHE 141725.000 0.074 0.929 -1.305 -0.660 -0.015 0.630 2.565
MIN_GEBAEUDEJAHR 141725.000 -0.066 0.924 -2.484 -0.383 -0.383 -0.383 6.819
KBA05_ANTG1 141725.000 0.493 1.013 -1.009 -0.298 0.413 1.124 1.835
KBA05_ANTG2 141725.000 -0.036 0.876 -0.966 -0.966 -0.162 0.641 2.248
KBA05_ANTG3 141725.000 -0.297 0.754 -0.595 -0.595 -0.595 -0.595 2.414
KBA05_ANTG4 141725.000 -0.230 0.747 -0.464 -0.464 -0.464 -0.464 2.731
KBA05_GBZ 141725.000 0.349 0.887 -1.659 -0.116 0.655 1.427 1.427
BALLRAUM 141725.000 0.068 0.968 -1.445 -0.987 0.387 0.845 1.303
EWDICHTE 141725.000 -0.034 0.935 -1.710 -1.129 0.034 0.616 1.198
INNENSTADT 141725.000 0.116 0.967 -1.750 -0.764 0.222 0.715 1.701
GEBAEUDETYP_RASTER 141725.000 0.124 0.899 -2.966 -0.800 0.283 0.283 1.367
KKK 141725.000 -0.298 1.043 -1.853 -0.792 0.270 0.270 1.331
MOBI_REGIO 141725.000 0.446 0.935 -1.280 0.093 0.779 1.465 2.152
ONLINE_AFFINITAET 141725.000 0.165 0.870 -1.763 -0.476 0.167 0.810 1.454
REGIOTYP 141725.000 -0.260 1.081 -1.988 -1.435 0.223 0.776 1.329
KBA13_ANZAHL_PKW 141725.000 0.120 0.986 -1.786 -0.570 -0.103 0.587 4.750
PLZ8_ANTG1 141725.000 0.293 0.938 -2.343 -0.257 0.787 0.787 1.830
PLZ8_ANTG2 141725.000 -0.078 0.918 -3.094 -0.890 0.212 0.212 1.314
PLZ8_ANTG3 141725.000 -0.212 0.903 -1.649 -0.623 -0.623 0.403 1.429
PLZ8_ANTG4 141725.000 -0.206 0.875 -0.935 -0.935 -0.935 0.443 1.820
PLZ8_HHZ 141725.000 0.028 0.962 -2.689 -0.616 -0.616 0.420 1.456
PLZ8_GBZ 141725.000 0.219 0.913 -2.160 -0.337 0.574 0.574 1.486
ARBEIT 141725.000 -0.343 1.012 -2.175 -1.173 -0.171 0.830 1.832
ORTSGR_KLS9 141725.000 -0.075 0.938 -1.868 -0.562 -0.127 0.743 1.614
RELAT_AB 141725.000 -0.127 1.045 -1.526 -0.789 -0.052 0.685 1.422
OST_WEST_KZ_O 141725.000 -0.321 0.665 -0.517 -0.517 -0.517 -0.517 1.933
PRAEGENDE_JUGENDJAHRE_DECADE 141725.000 -0.624 0.735 -3.572 -1.045 -0.540 -0.034 0.977
PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE 141725.000 0.669 1.208 -0.530 -0.530 -0.530 1.885 1.885
PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM 141725.000 -0.573 1.146 -1.707 -1.707 -1.707 0.586 0.586
CAMEO_INTL_2015_WEALTH 141725.000 -0.454 0.962 -1.551 -1.551 -0.870 0.494 1.176
CAMEO_INTL_2015_LIFE_STAGE_TYP 141725.000 0.335 0.905 -1.249 -0.577 0.767 0.767 1.439
PLZ8_BAUMAX_BLDNG_TYPE_BUSINESS 141725.000 -0.153 0.789 -0.373 -0.373 -0.373 -0.373 2.683
PLZ8_BAUMAX_BLDNG_TYPE_FAMILY 141725.000 0.166 0.806 -2.369 0.422 0.422 0.422 0.422
PLZ8_BAUMAX_FAMILY_HOMES 141725.000 -0.119 0.751 -1.300 -0.305 -0.305 -0.305 2.680
WOHNLAGE_RURAL_FLAG 141725.000 0.007 1.004 -0.553 -0.553 -0.553 -0.553 1.809
WOHNLAGE_CITY_NEIGHBOURHOOD 141725.000 -0.209 0.904 -1.455 -0.847 -0.239 0.369 1.585

I know we were told to NOT use fit_transform.

However, the idea behind StandardScaler is to normalize variables so they have a mean of 0 and stdev of 1.

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

https://stackoverflow.com/questions/40758562/can-anyone-explain-me-standardscaler

If I just use transform without fitting, this doesn't work on customer data, it doesn't get properly normalized to mean of 0 and stdev of 1.

I will diverge from the instructions here and refit the existing scaler to the customer data, so it's normalized properly. Or else I'm worried k-means won't work correctly.

I get the idea is to build a pipeline and to run customer data thru the pipeline, but I'm concerned by the fact that scaling doesn't appear to have worked properly in this case for customers data.

I will still run K-means after "transform" or "fit_transform" and see if clusters are affected by it.

Customer data did get closer to 0 and 1 though after scaling it with general population scaler.

In [269]:
# this code is just like the one a few cells above, 
# but I"m re-fitting scaler to customer data to normalize it to mean of 0 and stdev of 1
# scale

"""
customers_scaled was obtained thru transform (using general population scaler)
customers_scaled_1 was obtained thru fit_transform (using customer scaler)
"""

customers_scaled_1 = scaler.fit_transform(customers_cleaned_encoded_imputed)

print(type(customers_scaled_1))

customers_scaled_1 = pd.DataFrame(customers_scaled_1, columns=columns_list_customers)

print(type(customers_scaled_1))

customers_scaled_1.head()
<class 'numpy.ndarray'>
<class 'pandas.core.frame.DataFrame'>
Out[269]:
ALTERSKATEGORIE_GROB ANREDE_KZ FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER GREEN_AVANTGARDE HEALTH_TYP RETOURTYP_BK_S SEMIO_SOZ SEMIO_FAM SEMIO_REL SEMIO_MAT SEMIO_VERT SEMIO_LUST SEMIO_ERL SEMIO_KULT SEMIO_RAT SEMIO_KRIT SEMIO_DOM SEMIO_KAEM SEMIO_PFLICHT SEMIO_TRADV SOHO_KZ VERS_TYP ANZ_PERSONEN ANZ_TITEL HH_EINKOMMEN_SCORE W_KEIT_KIND_HH WOHNDAUER_2008 ANZ_HAUSHALTE_AKTIV ANZ_HH_TITEL KONSUMNAEHE MIN_GEBAEUDEJAHR KBA05_ANTG1 KBA05_ANTG2 KBA05_ANTG3 KBA05_ANTG4 KBA05_GBZ BALLRAUM EWDICHTE INNENSTADT GEBAEUDETYP_RASTER KKK MOBI_REGIO ONLINE_AFFINITAET REGIOTYP KBA13_ANZAHL_PKW PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB OST_WEST_KZ_O PRAEGENDE_JUGENDJAHRE_DECADE PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM CAMEO_INTL_2015_WEALTH CAMEO_INTL_2015_LIFE_STAGE_TYP PLZ8_BAUMAX_BLDNG_TYPE_BUSINESS PLZ8_BAUMAX_BLDNG_TYPE_FAMILY PLZ8_BAUMAX_FAMILY_HOMES WOHNLAGE_RURAL_FLAG WOHNLAGE_CITY_NEIGHBOURHOOD
0 0.704 -0.703 0.723 -0.517 0.516 -0.601 0.137 -0.553 1.007 -1.259 0.895 1.033 0.657 -0.814 1.267 0.458 0.908 -1.068 0.049 -1.348 -0.514 -1.885 -1.449 -0.720 -1.369 -0.100 -1.003 -0.193 -0.134 -1.362 0.872 0.306 -0.278 -0.123 1.292 -0.343 -0.079 0.773 -0.395 -0.314 0.345 -0.616 -1.171 -0.400 0.178 -1.491 0.356 0.002 -1.599 1.521 0.527 0.316 -0.456 -0.833 1.484 1.387 -1.810 -1.447 -1.339 -0.295 -0.574 1.007 -0.989 -1.141 -0.266 -0.278 0.318 -0.247 1.795 -1.378
1 0.704 1.423 0.723 -0.517 0.516 -0.601 2.181 0.966 1.007 0.027 0.895 -1.438 -1.125 -1.499 -0.246 -1.246 0.908 1.215 -1.662 -0.635 1.734 0.628 1.339 0.677 -1.369 -0.100 0.997 -0.912 -0.134 -1.362 0.872 0.306 -0.278 -0.123 -1.484 -0.343 -0.079 0.773 -0.395 -0.314 -0.525 1.276 0.073 -1.930 -1.027 0.544 -0.378 -1.477 1.469 -0.697 -0.586 0.316 1.816 0.741 -0.670 -1.606 0.170 -0.055 0.072 -0.295 -0.574 1.007 -0.989 0.277 0.477 -0.278 0.318 2.401 -0.557 -0.033
2 0.704 -0.703 0.723 -0.517 0.516 0.397 -0.885 -0.553 -0.993 0.027 -0.796 1.033 0.657 -0.129 0.258 1.025 -0.316 -1.068 0.049 0.079 -0.514 -0.628 -0.334 -0.022 0.785 -0.100 -1.003 -1.631 -0.134 0.451 0.872 0.306 -0.278 -0.123 -0.790 -0.343 0.622 -1.061 -0.395 -0.314 0.345 1.276 -1.793 1.130 0.178 0.544 0.356 -0.737 0.958 0.233 0.527 -0.884 -0.456 -0.833 -0.670 0.389 -1.810 -0.983 -1.339 -0.295 -1.261 -0.993 1.012 -0.432 0.477 -0.278 0.318 -0.247 1.795 -1.378
3 -0.666 -0.703 -1.241 -0.517 -0.665 2.392 3.202 -0.553 -0.993 1.312 0.895 -0.202 0.657 0.556 1.267 -0.110 0.296 -0.497 0.620 1.506 -0.514 0.628 -0.891 1.375 0.785 -0.100 0.997 1.244 -0.134 1.659 -1.263 0.306 0.141 -0.123 -1.484 -0.343 -1.482 1.690 2.267 -0.314 -0.525 -0.616 0.073 -0.400 -1.027 1.562 -0.378 1.482 1.469 -0.466 -0.586 1.516 0.680 0.741 -0.670 -0.608 0.170 -0.055 -1.339 -0.295 0.802 -0.993 1.012 0.986 -1.751 -0.278 0.318 1.077 -0.557 0.639
4 -0.666 -0.703 0.723 -0.517 0.516 -0.601 0.137 0.206 1.007 1.312 -0.796 1.033 0.063 0.556 -1.255 1.025 0.296 -0.497 1.191 -0.635 0.610 0.628 -0.334 -0.022 0.785 -0.100 0.997 -0.193 -0.134 -1.362 0.872 0.306 -0.278 -0.123 -0.790 -0.343 -0.079 0.773 0.936 -0.314 -0.525 1.276 0.695 1.639 0.178 -0.473 -0.378 0.002 -0.576 1.423 -0.586 0.316 0.680 0.741 1.484 1.387 0.170 0.873 1.483 -0.295 -0.574 1.007 -0.989 0.277 0.477 -0.278 0.318 -0.247 -0.557 -0.706
In [270]:
# now it's scaled nicely to mean of 0 and stdev of 1
customers_scaled_1.describe().transpose()
Out[270]:
count mean std min 25% 50% 75% max
ALTERSKATEGORIE_GROB 141725.000 0.000 1.000 -3.406 -0.666 0.704 0.704 0.704
ANREDE_KZ 141725.000 0.000 1.000 -0.703 -0.703 -0.703 1.423 1.423
FINANZ_MINIMALIST 141725.000 -0.000 1.000 -3.206 -0.259 0.723 0.723 0.723
FINANZ_SPARER 141725.000 0.000 1.000 -0.517 -0.517 -0.517 0.681 4.277
FINANZ_VORSORGER 141725.000 0.000 1.000 -4.210 -0.665 0.516 0.516 0.516
FINANZ_ANLEGER 141725.000 0.000 1.000 -0.601 -0.601 -0.601 0.397 3.389
FINANZ_UNAUFFAELLIGER 141725.000 0.000 1.000 -0.885 -0.885 0.137 0.137 3.202
FINANZ_HAUSBAUER 141725.000 0.000 1.000 -1.313 -0.553 -0.553 0.966 1.725
GREEN_AVANTGARDE 141725.000 0.000 1.000 -0.993 -0.993 -0.993 1.007 1.007
HEALTH_TYP 141725.000 -0.000 1.000 -1.259 -1.259 0.027 1.312 1.312
RETOURTYP_BK_S 141725.000 -0.000 1.000 -2.486 -0.796 0.050 0.895 0.895
SEMIO_SOZ 141725.000 -0.000 1.000 -2.055 -0.820 -0.202 1.033 1.651
SEMIO_FAM 141725.000 -0.000 1.000 -1.719 -1.125 0.063 0.657 1.845
SEMIO_REL 141725.000 -0.000 1.000 -1.499 -0.814 -0.129 0.556 2.611
SEMIO_MAT 141725.000 -0.000 1.000 -1.255 -1.255 0.258 1.267 1.772
SEMIO_VERT 141725.000 0.000 1.000 -2.382 -0.678 0.458 1.025 1.025
SEMIO_LUST 141725.000 0.000 1.000 -2.763 -0.927 0.296 0.908 0.908
SEMIO_ERL 141725.000 -0.000 1.000 -2.209 -1.068 -0.497 1.215 1.215
SEMIO_KULT 141725.000 0.000 1.000 -1.662 -0.521 0.049 0.620 1.761
SEMIO_RAT 141725.000 -0.000 1.000 -1.348 -0.635 0.079 0.793 2.933
SEMIO_KRIT 141725.000 0.000 1.000 -1.639 -0.514 -0.514 0.610 1.734
SEMIO_DOM 141725.000 -0.000 1.000 -1.885 -0.628 -0.000 0.628 1.884
SEMIO_KAEM 141725.000 -0.000 1.000 -1.449 -0.891 -0.334 0.782 1.897
SEMIO_PFLICHT 141725.000 -0.000 1.000 -1.419 -0.720 -0.022 0.677 2.772
SEMIO_TRADV 141725.000 -0.000 1.000 -1.369 -0.651 0.067 0.785 2.939
SOHO_KZ 141725.000 -0.000 1.000 -0.100 -0.100 -0.100 -0.100 10.030
VERS_TYP 141725.000 0.000 1.000 -1.003 -1.003 0.997 0.997 0.997
ANZ_PERSONEN 141725.000 -0.000 1.000 -1.631 -0.912 -0.193 0.525 13.462
ANZ_TITEL 141725.000 -0.000 1.000 -0.134 -0.134 -0.134 -0.134 32.712
HH_EINKOMMEN_SCORE 141725.000 0.000 1.000 -1.362 -0.758 -0.154 1.055 1.659
W_KEIT_KIND_HH 141725.000 0.000 1.000 -1.796 -1.263 0.872 0.872 0.872
WOHNDAUER_2008 141725.000 -0.000 1.000 -6.635 0.306 0.306 0.306 0.306
ANZ_HAUSHALTE_AKTIV 141725.000 -0.000 1.000 -0.278 -0.278 -0.278 -0.069 36.214
ANZ_HH_TITEL 141725.000 -0.000 1.000 -0.123 -0.123 -0.123 -0.123 42.358
KONSUMNAEHE 141725.000 -0.000 1.000 -1.484 -0.790 -0.096 0.598 2.681
MIN_GEBAEUDEJAHR 141725.000 -0.000 1.000 -2.616 -0.343 -0.343 -0.343 7.449
KBA05_ANTG1 141725.000 0.000 1.000 -1.482 -0.781 -0.079 0.622 1.324
KBA05_ANTG2 141725.000 0.000 1.000 -1.061 -1.061 -0.144 0.773 2.607
KBA05_ANTG3 141725.000 0.000 1.000 -0.395 -0.395 -0.395 -0.395 3.598
KBA05_ANTG4 141725.000 -0.000 1.000 -0.314 -0.314 -0.314 -0.314 3.962
KBA05_GBZ 141725.000 -0.000 1.000 -2.263 -0.525 0.345 1.214 1.214
BALLRAUM 141725.000 -0.000 1.000 -1.562 -1.089 0.330 0.803 1.276
EWDICHTE 141725.000 -0.000 1.000 -1.793 -1.171 0.073 0.695 1.317
INNENSTADT 141725.000 -0.000 1.000 -1.930 -0.910 0.110 0.620 1.639
GEBAEUDETYP_RASTER 141725.000 0.000 1.000 -3.436 -1.027 0.178 0.178 1.382
KKK 141725.000 0.000 1.000 -1.491 -0.473 0.544 0.544 1.562
MOBI_REGIO 141725.000 0.000 1.000 -1.846 -0.378 0.356 1.090 1.824
ONLINE_AFFINITAET 141725.000 0.000 1.000 -2.216 -0.737 0.002 0.742 1.482
REGIOTYP 141725.000 0.000 1.000 -1.599 -1.088 0.447 0.958 1.469
KBA13_ANZAHL_PKW 141725.000 -0.000 1.000 -1.933 -0.700 -0.226 0.473 4.695
PLZ8_ANTG1 141725.000 -0.000 1.000 -2.811 -0.586 0.527 0.527 1.639
PLZ8_ANTG2 141725.000 0.000 1.000 -3.284 -0.884 0.316 0.316 1.516
PLZ8_ANTG3 141725.000 0.000 1.000 -1.592 -0.456 -0.456 0.680 1.816
PLZ8_ANTG4 141725.000 0.000 1.000 -0.833 -0.833 -0.833 0.741 2.316
PLZ8_HHZ 141725.000 0.000 1.000 -2.824 -0.670 -0.670 0.407 1.484
PLZ8_GBZ 141725.000 -0.000 1.000 -2.604 -0.608 0.389 0.389 1.387
ARBEIT 141725.000 0.000 1.000 -1.810 -0.820 0.170 1.160 2.150
ORTSGR_KLS9 141725.000 -0.000 1.000 -1.911 -0.519 -0.055 0.873 1.801
RELAT_AB 141725.000 -0.000 1.000 -1.339 -0.633 0.072 0.777 1.483
OST_WEST_KZ_O 141725.000 0.000 1.000 -0.295 -0.295 -0.295 -0.295 3.390
PRAEGENDE_JUGENDJAHRE_DECADE 141725.000 0.000 1.000 -4.013 -0.574 0.114 0.802 2.178
PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE 141725.000 0.000 1.000 -0.993 -0.993 -0.993 1.007 1.007
PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM 141725.000 0.000 1.000 -0.989 -0.989 -0.989 1.012 1.012
CAMEO_INTL_2015_WEALTH 141725.000 -0.000 1.000 -1.141 -1.141 -0.432 0.986 1.695
CAMEO_INTL_2015_LIFE_STAGE_TYP 141725.000 0.000 1.000 -1.751 -1.008 0.477 0.477 1.220
PLZ8_BAUMAX_BLDNG_TYPE_BUSINESS 141725.000 -0.000 1.000 -0.278 -0.278 -0.278 -0.278 3.597
PLZ8_BAUMAX_BLDNG_TYPE_FAMILY 141725.000 -0.000 1.000 -3.146 0.318 0.318 0.318 0.318
PLZ8_BAUMAX_FAMILY_HOMES 141725.000 0.000 1.000 -1.572 -0.247 -0.247 -0.247 3.726
WOHNLAGE_RURAL_FLAG 141725.000 -0.000 1.000 -0.557 -0.557 -0.557 -0.557 1.795
WOHNLAGE_CITY_NEIGHBOURHOOD 141725.000 0.000 1.000 -1.378 -0.706 -0.033 0.639 1.984
In [271]:
pca
Out[271]:
PCA(copy=True, iterated_power='auto', n_components=23, random_state=42,
  svd_solver='auto', tol=0.0, whiten=False)
In [272]:
def cluster_customer_data(my_customer_scaled_data, my_graph_title):
    """do PCA, k means prediction 
    data scaled with general population scaler with transform
    """
    # PCA
    customers_pca = pca.transform(my_customer_scaled_data)
    print(type(customers_pca))
    print()
    print('customers_pca')
    print(customers_pca)
    print()

    # predict with kmeans

    preds_customers = model_10.predict(customers_pca)
    print()
    print('preds_customers')
    print(preds_customers)
    print()

    # make a df with clusters
    clusters_customers = pd.DataFrame ({'clusters_customers' : preds_customers})
    print('clusters_customers')
    print(clusters_customers.head(30))

    print()
    print('clusters_customers summary')
    print(clusters_customers['clusters_customers'].value_counts().sort_index())


    #visualize
    fig, ax = plt.subplots(1, 1,figsize=(10, 7))
    clusters_customers['clusters_customers'].value_counts().sort_index().plot('barh').invert_yaxis() 

    # title and axis labels
    plt.title(my_graph_title)
    ax.set_ylabel('Cluster')
    ax.set_xlabel('Frequency')

    # commas for x axis 
    fmt = '{x:,.0f}'
    tick = mtick.StrMethodFormatter(fmt)
    ax.xaxis.set_major_formatter(tick)

    plt.show()  
    
    return clusters_customers
In [273]:
clusters_customers = cluster_customer_data(customers_scaled, 
                                           'CUSTOMER CLUSTERS - SCALED WITH GENERAL POPULATION SCALER WITH TRANSFORM')
<class 'numpy.ndarray'>

customers_pca
[[ -6.11349846e+00   1.26367633e+00   4.51624615e+00 ...,   5.82684947e-01
    6.68465743e-01  -2.87627831e-01]
 [ -9.85886812e-01   5.04474508e+00  -3.08658723e-01 ...,  -9.13001259e-01
   -8.52958642e-01  -5.54500696e-01]
 [ -4.62294421e+00   4.20880109e-01   2.46878937e+00 ...,  -7.95304746e-01
   -3.72791245e-01  -5.38535422e-03]
 ..., 
 [ -3.88545211e+00   3.16708754e+00   2.84906982e+00 ...,  -8.08336604e-01
   -2.26155234e-01   3.74865189e-02]
 [ -7.71574165e-01   2.85994775e+00  -1.94431550e+00 ...,  -4.49853496e-01
   -2.77239660e-01   1.22606848e-01]
 [ -4.27730615e+00  -2.68666526e+00   1.98565480e+00 ...,   7.98519018e-01
    9.33190636e-01  -5.50849392e-02]]


preds_customers
[7 7 4 ..., 7 8 4]

clusters_customers
    clusters_customers
0                    7
1                    7
2                    4
3                    6
4                    7
5                    7
6                    7
7                    1
8                    7
9                    7
10                   4
11                   4
12                   5
13                   7
14                   7
15                   7
16                   9
17                   8
18                   7
19                   4
20                   8
21                   1
22                   8
23                   7
24                   8
25                   7
26                   5
27                   1
28                   7
29                   7

clusters_customers summary
0     3362
1     8262
2     2291
3     1188
4    27217
5    11724
6    13889
7    61217
8    11492
9     1083
Name: clusters_customers, dtype: int64
In [274]:
clusters_customers = cluster_customer_data(customers_scaled_1, 
                                           'CUSTOMER CLUSTERS - SCALED WITH CUSTOMER SCALER WITH FIT_TRANSFORM')
<class 'numpy.ndarray'>

customers_pca
[[-4.23746511  0.16041729  3.26813252 ...,  0.67741733  0.75784593
  -0.43335095]
 [ 1.24245661  4.07498237 -2.05255157 ..., -1.00470502 -1.01046761
  -0.56929711]
 [-2.83842733 -0.95240616  1.11145631 ..., -0.66595892 -0.28169006
  -0.12803516]
 ..., 
 [-1.92260104  2.18188353  1.47168361 ..., -0.7474713  -0.32467245
  -0.01361919]
 [ 1.41360388  1.51919681 -3.77835715 ..., -0.28953571 -0.32031142
  -0.01535212]
 [-2.33839872 -4.69659083  0.57456595 ...,  0.94959892  1.21053172
  -0.25498035]]


preds_customers
[4 5 4 ..., 7 5 0]

clusters_customers
    clusters_customers
0                    4
1                    5
2                    4
3                    3
4                    6
5                    4
6                    7
7                    1
8                    0
9                    0
10                   4
11                   6
12                   5
13                   7
14                   7
15                   0
16                   9
17                   8
18                   7
19                   6
20                   2
21                   1
22                   8
23                   2
24                   8
25                   9
26                   5
27                   1
28                   8
29                   0

clusters_customers summary
0    11738
1     8880
2    10590
3     5742
4    24800
5    14979
6    21735
7    27279
8    11099
9     4883
Name: clusters_customers, dtype: int64
The below screenshot demonstrates how sensitive k means is to properly scaling the data. If the data is not correclty scaled, that changes the distances between points, affects clusters and produces a misleading result. I will go with the option on the right: scaling with fit_transform (contrary to what the task asked for). The results with fit_transform seem more conservative, less different from the general population. I believe the result on the left is not correct. It's produced by k means run on data which was not properly scaled to mean 0 and stdev 1 (because we didn't refit the scaler to customer data).

image.png

image.png

Step 3.3: Compare Customer Data to Demographics Data

At this point, you have clustered data based on demographics of the general population of Germany, and seen how the customer data for a mail-order sales company maps onto those demographic clusters. In this final substep, you will compare the two cluster distributions to see where the strongest customer base for the company is.

Consider the proportion of persons in each cluster for the general population, and the proportions for the customers. If we think the company's customer base to be universal, then the cluster assignment proportions should be fairly similar between the two. If there are only particular segments of the population that are interested in the company's products, then we should see a mismatch from one to the other. If there is a higher proportion of persons in a cluster for the customer data compared to the general population (e.g. 5% of persons are assigned to a cluster for the general population, but 15% of the customer data is closest to that cluster's centroid) then that suggests the people in that cluster to be a target audience for the company. On the other hand, the proportion of the data in a cluster being larger in the general population than the customer data (e.g. only 2% of customers closest to a population centroid that captures 6% of the data) suggests that group of persons to be outside of the target demographics.

Take a look at the following points in this step:

  • Compute the proportion of data points in each cluster for the general population and the customer data. Visualizations will be useful here: both for the individual dataset proportions, but also to visualize the ratios in cluster representation between groups. Seaborn's countplot() or barplot() function could be handy.
    • Recall the analysis you performed in step 1.1.3 of the project, where you separated out certain data points from the dataset if they had more than a specified threshold of missing values. If you found that this group was qualitatively different from the main bulk of the data, you should treat this as an additional data cluster in this analysis. Make sure that you account for the number of data points in this subset, for both the general population and customer datasets, when making your computations!
  • Which cluster or clusters are overrepresented in the customer dataset compared to the general population? Select at least one such cluster and infer what kind of people might be represented by that cluster. Use the principal component interpretations from step 2.3 or look at additional components to help you make this inference. Alternatively, you can use the .inverse_transform() method of the PCA and StandardScaler objects to transform centroids back to the original data space and interpret the retrieved values directly.
  • Perform a similar investigation for the underrepresented clusters. Which cluster or clusters are underrepresented in the customer dataset compared to the general population, and what kinds of people are typified by these clusters?
In [275]:
# this is just testing. Again, as described above, "transform" doesn't work well. "fit_transform" works fine


# keep this one commented out, dont use for analysis
# clusters_customers = cluster_customer_data(customers_scaled, 
#                                            'CUSTOMER CLUSTERS - SCALED WITH GENERAL POPULATION SCALER WITH TRANSFORM')

# use this one for analysis
# clusters_customers = cluster_customer_data(customers_scaled_1, 
#                                            'CUSTOMER CLUSTERS - SCALED WITH CUSTOMER SCALER WITH FIT_TRANSFORM')
In [276]:
# too many missing values in general population

print(len(azdias))
print(len(azdias_cleaned_encoded))
azdias_too_many_NA_in_rows = len(azdias) - len(azdias_cleaned_encoded)
print(azdias_too_many_NA_in_rows)
891221
798067
93154
In [277]:
# too many missing values in customers
print(len(customers))
print(len(customers_cleaned_encoded))
customers_too_many_NA_in_rows = len(customers) - len(customers_cleaned_encoded)
print(customers_too_many_NA_in_rows)
191652
141725
49927
In [278]:
# general pop clusters
summary_general = clusters['clusters_general'].value_counts().sort_index()
summary_general
Out[278]:
0     63235
1     61591
2     84570
3     75084
4     88195
5     88080
6     86591
7    109102
8     82741
9     58878
Name: clusters_general, dtype: int64
In [279]:
# customer clusters
summary_customers = clusters_customers.clusters_customers.value_counts().sort_index()
summary_customers
Out[279]:
0    11738
1     8880
2    10590
3     5742
4    24800
5    14979
6    21735
7    27279
8    11099
9     4883
Name: clusters_customers, dtype: int64
In [280]:
# create a dataframe with general population and customer clusters
# https://stackoverflow.com/questions/18062135/combining-two-series-into-a-dataframe-in-pandas
summary_comparison = pd.concat([summary_general, summary_customers], axis =1)
print(type(summary_comparison))
summary_comparison.reset_index(inplace=True)
summary_comparison.columns = ['cluster','general','customers']
summary_comparison
<class 'pandas.core.frame.DataFrame'>
Out[280]:
cluster general customers
0 0 63235 11738
1 1 61591 8880
2 2 84570 10590
3 3 75084 5742
4 4 88195 24800
5 5 88080 14979
6 6 86591 21735
7 7 109102 27279
8 8 82741 11099
9 9 58878 4883
In [281]:
# visualize gen pop and cust clusters

fig, ax = plt.subplots(1, 2,figsize=(15, 7)) 

plt.subplot(1,2,1)  

plt.bar(summary_comparison['cluster'], summary_comparison['general'], align='center', alpha=0.5)

plt.xlabel('Clusters', fontsize=14)
plt.ylabel('People', fontsize=14)

plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
title = 'General Population Clusters'
plt.title(title,  fontsize=18)


plt.subplot(1,2,2) 
plt.bar(summary_comparison['cluster'], summary_comparison['customers'], align='center', alpha=0.5)

plt.xlabel('Clusters', fontsize=14)
plt.ylabel('People', fontsize=14)

plt.xticks(fontsize=15)
plt.yticks(fontsize=15)


title = 'Customer Clusters'
plt.title(title,  fontsize=18)
plt.show()
In [282]:
# add clusters which are data with lots of missing rows
# https://thispointer.com/python-pandas-how-to-add-rows-in-a-dataframe-using-dataframe-append-loc-iloc/
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html
summary_comparison_with_NAN_cluster = summary_comparison.append({'cluster' : -1, 
                                                'general': azdias_too_many_NA_in_rows, 
                                                'customers':customers_too_many_NA_in_rows},
                                              ignore_index=True).set_index('cluster', drop=False)
summary_comparison_with_NAN_cluster
Out[282]:
cluster general customers
cluster
0 0 63235 11738
1 1 61591 8880
2 2 84570 10590
3 3 75084 5742
4 4 88195 24800
5 5 88080 14979
6 6 86591 21735
7 7 109102 27279
8 8 82741 11099
9 9 58878 4883
-1 -1 93154 49927
In [283]:
# visualize gen pop and cust clusters again. This time with NAN / null cluster

fig, ax = plt.subplots(1, 2,figsize=(15, 7)) 

plt.subplot(1,2,1)  

plt.bar(summary_comparison_with_NAN_cluster['cluster'], summary_comparison_with_NAN_cluster['general'], align='center', alpha=0.5)

plt.xlabel('Clusters', fontsize=14)
plt.ylabel('People', fontsize=14)

plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
title = 'General Population Clusters. -1 is lots of NAN.'
plt.title(title,  fontsize=18)


plt.subplot(1,2,2) 
plt.bar(summary_comparison_with_NAN_cluster['cluster'], summary_comparison_with_NAN_cluster['customers'], align='center', alpha=0.5)

plt.xlabel('Clusters', fontsize=14)
plt.ylabel('People', fontsize=14)

plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
title = 'Customer Clusters. -1 is lots of NAN.'
plt.title(title,  fontsize=18)
plt.show()
In [284]:
# all clusters for gen pop and cust, incuding NAN 
summary_comparison_with_NAN_cluster
Out[284]:
cluster general customers
cluster
0 0 63235 11738
1 1 61591 8880
2 2 84570 10590
3 3 75084 5742
4 4 88195 24800
5 5 88080 14979
6 6 86591 21735
7 7 109102 27279
8 8 82741 11099
9 9 58878 4883
-1 -1 93154 49927
In [285]:
# https://stackoverflow.com/questions/43217916/pandas-data-precision
# show only two decimal points
pd.set_option('display.float_format', '{:0.4f}'.format)
In [286]:
# gen_perc & cust_perc show proportion of clusters
# diff is difference

summary_comparison_with_NAN_cluster['gen_perc'] = summary_comparison_with_NAN_cluster['general'] / len(azdias)
summary_comparison_with_NAN_cluster['cust_perc'] = summary_comparison_with_NAN_cluster['customers'] / len(customers)
summary_comparison_with_NAN_cluster['diff'] =summary_comparison_with_NAN_cluster['cust_perc']- summary_comparison_with_NAN_cluster['gen_perc']

summary_comparison_with_NAN_cluster
Out[286]:
cluster general customers gen_perc cust_perc diff
cluster
0 0 63235 11738 0.0710 0.0612 -0.0097
1 1 61591 8880 0.0691 0.0463 -0.0228
2 2 84570 10590 0.0949 0.0553 -0.0396
3 3 75084 5742 0.0842 0.0300 -0.0543
4 4 88195 24800 0.0990 0.1294 0.0304
5 5 88080 14979 0.0988 0.0782 -0.0207
6 6 86591 21735 0.0972 0.1134 0.0162
7 7 109102 27279 0.1224 0.1423 0.0199
8 8 82741 11099 0.0928 0.0579 -0.0349
9 9 58878 4883 0.0661 0.0255 -0.0406
-1 -1 93154 49927 0.1045 0.2605 0.1560

Compare the proportion of data in each cluster for the customer data to the proportion of data in each cluster for the general population.

In [287]:
print("Clusters from most likely ones to be target audience to least likely ones")
summary_comparison_with_NAN_cluster.sort_values(by='diff',ascending= False)
# 4 is the most likely to be target audience
# 3 is outside of the target audience
Clusters from most likely ones to be target audience to least likely ones
Out[287]:
cluster general customers gen_perc cust_perc diff
cluster
-1 -1 93154 49927 0.1045 0.2605 0.1560
4 4 88195 24800 0.0990 0.1294 0.0304
7 7 109102 27279 0.1224 0.1423 0.0199
6 6 86591 21735 0.0972 0.1134 0.0162
0 0 63235 11738 0.0710 0.0612 -0.0097
5 5 88080 14979 0.0988 0.0782 -0.0207
1 1 61591 8880 0.0691 0.0463 -0.0228
8 8 82741 11099 0.0928 0.0579 -0.0349
2 2 84570 10590 0.0949 0.0553 -0.0396
9 9 58878 4883 0.0661 0.0255 -0.0406
3 3 75084 5742 0.0842 0.0300 -0.0543
In [288]:
# https://stackoverflow.com/questions/42128467/matplotlib-plot-multiple-columns-of-pandas-data-frame-on-the-bar-chart
# https://stackoverflow.com/questions/51174691/how-to-increase-image-size-of-pandas-dataframe-plot-in-jupyter-notebook/51174822


summary_comparison_with_NAN_cluster.plot(x='cluster', y=['gen_perc', 'cust_perc'], 
                                         kind="bar",
                                         figsize=(10,5),
                                         alpha=0.5, 
                                         color=['#00aad2', '#ff7900'])


plt.xlabel('Clusters', fontsize=14)
plt.ylabel('Proportion', fontsize=14)

plt.xticks(fontsize=15, rotation=0)
plt.yticks(fontsize=15)
title = 'Clusters Proportion: Gen Pop vs Customers'
plt.title(title,  fontsize=18)


plt.show()

What kinds of people are part of a cluster that is overrepresented in the customer data compared to the general population? - See the discussion at the end of the project

In [289]:
print('Overrepresented in customers')
summary_comparison_with_NAN_cluster[summary_comparison_with_NAN_cluster['diff']>0].sort_values(by='diff',ascending= False)
Overrepresented in customers
Out[289]:
cluster general customers gen_perc cust_perc diff
cluster
-1 -1 93154 49927 0.1045 0.2605 0.1560
4 4 88195 24800 0.0990 0.1294 0.0304
7 7 109102 27279 0.1224 0.1423 0.0199
6 6 86591 21735 0.0972 0.1134 0.0162
In [290]:
summary_comparison_with_NAN_cluster[summary_comparison_with_NAN_cluster['diff']>0].sort_values(by='diff',
                                                                                               ascending= False).plot(
    x='cluster', y=['gen_perc', 'cust_perc'], 
                                         kind="bar",
                                         figsize=(10,5),
                                         alpha=0.5, 
                                         color=['#00aad2', '#ff7900'])


plt.xlabel('Clusters', fontsize=14)
plt.ylabel('Proportion', fontsize=14)

plt.xticks(fontsize=15, rotation=0)
plt.yticks(fontsize=15)
title = 'Overrepresented in Customers'
plt.title(title,  fontsize=18)


plt.show()

What kinds of people are part of a cluster that is underrepresented in the customer data compared to the general population? - See the discussion at the end of the project

In [291]:
summary_comparison_with_NAN_cluster[summary_comparison_with_NAN_cluster['diff']<0].sort_values(by='diff',ascending= True).plot(
    x='cluster', y=['gen_perc', 'cust_perc'], 
                                         kind="bar",
                                         figsize=(10,5),
                                         alpha=0.5, 
                                         color=['#00aad2', '#ff7900'])


plt.xlabel('Clusters', fontsize=14)
plt.ylabel('Proportion', fontsize=14)

plt.xticks(fontsize=15, rotation=0)
plt.yticks(fontsize=15)
title = 'Underrepresented in Customers'
plt.title(title,  fontsize=18)


plt.show()
In [292]:
print('Underrepresented in customers')
summary_comparison_with_NAN_cluster[summary_comparison_with_NAN_cluster['diff']<0].sort_values(by='diff',ascending= False)
Underrepresented in customers
Out[292]:
cluster general customers gen_perc cust_perc diff
cluster
0 0 63235 11738 0.0710 0.0612 -0.0097
5 5 88080 14979 0.0988 0.0782 -0.0207
1 1 61591 8880 0.0691 0.0463 -0.0228
8 8 82741 11099 0.0928 0.0579 -0.0349
2 2 84570 10590 0.0949 0.0553 -0.0396
9 9 58878 4883 0.0661 0.0255 -0.0406
3 3 75084 5742 0.0842 0.0300 -0.0543

Describe people over- and underrepresented

In [296]:
pca.inverse_transform(model_10.cluster_centers_)
Out[296]:
array([[ -9.24191488e-01,  -1.16097958e+00,   5.39324601e-02,
          8.70070530e-01,  -8.81786417e-01,   6.09409115e-01,
          9.51211747e-01,  -6.64583242e-01,   2.77389281e-01,
         -4.70555505e-02,  -6.68973868e-01,   9.96761372e-01,
          1.29136949e+00,   1.22143261e+00,   7.90982597e-01,
          1.09428065e+00,  -5.81587940e-01,  -1.35857321e+00,
          1.40715317e+00,   2.27390682e-01,  -1.15929239e+00,
         -1.08331114e+00,  -1.27077483e+00,   7.59027574e-01,
          8.14425396e-01,   1.22459534e-02,  -2.57594490e-01,
          2.86605810e-01,  -2.53545202e-02,  -4.61070005e-01,
         -3.12206099e-01,  -1.91085205e-01,  -3.21427943e-01,
         -9.99714790e-02,   4.18542847e-01,   2.39956686e-01,
          4.27275069e-01,  -3.70771364e-02,  -2.88688531e-01,
         -3.74890454e-01,   4.76947260e-01,   3.13346478e-01,
         -5.15634664e-01,   4.16627045e-01,   2.80750031e-01,
         -8.62237843e-02,   4.42264201e-01,   7.10863521e-01,
         -1.23973044e-01,   2.40027202e-01,   5.62153953e-01,
         -3.83486211e-01,  -5.59046875e-01,  -5.53717225e-01,
         -5.46198607e-02,   4.57515692e-01,  -3.56194272e-01,
         -5.07270545e-01,  -3.46300376e-01,  -1.34879797e-01,
          7.72964755e-01,   2.77389281e-01,  -1.97087977e-01,
         -5.44795507e-01,   2.70537521e-01,  -3.17508861e-01,
          2.57139536e-01,  -2.43054907e-01,   3.84545160e-01,
         -4.04281099e-01],
       [  3.25672199e-01,   3.75769797e-02,  -3.61874014e-01,
         -1.84418735e-01,   1.90461686e-01,  -4.72801854e-01,
         -4.00823390e-01,   7.18594192e-01,  -2.17195024e-01,
          9.54998835e-02,   4.23178823e-01,   5.70861443e-02,
         -6.51874054e-02,  -2.77819616e-01,  -2.27553933e-01,
         -1.19484339e-01,   3.73250883e-01,   2.33676401e-01,
         -2.60635387e-01,  -2.71526796e-01,   1.91604635e-01,
          9.61831490e-02,   2.21792198e-01,  -2.88449936e-01,
         -3.72923224e-01,  -1.26938736e-02,   3.13988935e-01,
         -2.69128076e-01,   1.09892678e-01,   5.58318830e-01,
          3.73822537e-01,   1.11159615e-01,   8.84809730e-01,
          4.74223303e-01,  -1.24676639e+00,  -1.78635786e-01,
         -8.41610259e-01,  -2.60164179e-01,   1.94457072e-01,
          1.02866889e+00,  -9.06307347e-01,  -8.19346309e-01,
          1.03075383e+00,  -1.00877507e+00,  -1.39446042e+00,
         -2.27995927e-01,  -9.72397796e-01,  -5.16328632e-01,
          4.55488090e-02,  -5.70510999e-01,  -1.26583771e+00,
          4.01541589e-01,   1.05962850e+00,   1.29912866e+00,
          1.49616832e-01,  -1.11686737e+00,   5.38310788e-01,
          1.09429309e+00,   4.91890051e-01,   4.65749947e-01,
         -3.59287819e-01,  -2.17195024e-01,   1.25895987e-01,
          7.74688493e-01,  -6.12918370e-01,   2.47330910e+00,
         -2.35712044e+00,  -1.34899775e+00,  -5.77719812e-01,
          7.19572717e-01],
       [ -9.97763742e-01,   9.15998086e-01,  -5.36210777e-01,
          8.54984648e-01,  -8.34377008e-01,   1.08557801e+00,
          8.92578679e-01,  -2.04375111e-01,  -8.32115334e-02,
          1.69524870e-01,  -6.30745800e-01,  -6.18584329e-01,
         -1.91117836e-01,   6.49719505e-01,   5.39286365e-01,
         -8.93406894e-01,  -5.65968692e-01,  -1.22225416e-01,
         -7.87144103e-02,   1.31304797e+00,   5.21762599e-01,
          7.50644300e-01,   6.98999655e-01,   1.00912732e+00,
          1.03379651e+00,   1.82363679e-03,  -1.86450657e-01,
          1.97697311e-01,  -8.22986427e-03,  -2.72955889e-01,
         -4.55911598e-01,  -1.92861491e-01,  -3.25243956e-01,
         -1.17904573e-01,   4.27591661e-01,   2.68769642e-01,
          4.18112687e-01,   3.82061039e-02,  -3.71131947e-01,
         -3.73005121e-01,   5.09260448e-01,   3.77636110e-01,
         -5.65495314e-01,   4.67834066e-01,   2.57012091e-01,
         -1.02032764e-01,   4.53225366e-01,   5.26844734e-01,
         -1.26864276e-01,   2.63830700e-01,   6.04278083e-01,
         -4.09822262e-01,  -6.19147255e-01,  -6.20765001e-01,
         -8.15112674e-02,   4.93889515e-01,  -4.28017242e-01,
         -5.87602241e-01,  -4.28332802e-01,  -1.52202226e-01,
          4.54890467e-01,  -8.32115334e-02,  -1.50357284e-02,
         -5.16607963e-01,   2.40153203e-01,  -3.56173540e-01,
          2.80643372e-01,  -2.72502540e-01,   4.06186682e-01,
         -4.23219611e-01],
       [ -1.17501715e+00,   9.89533472e-01,  -1.39946171e+00,
          1.07435675e+00,  -8.89981366e-01,   9.33569956e-01,
          7.73331700e-01,   7.66983867e-01,  -4.32844063e-01,
          8.88396975e-02,  -5.12714885e-01,  -4.21563726e-01,
         -6.42476317e-02,   7.85787163e-01,   6.24982637e-01,
         -1.06520090e+00,  -7.16992756e-01,  -2.10788720e-01,
          3.89735025e-02,   1.44286157e+00,   5.58878350e-01,
          8.88173930e-01,   9.97890000e-01,   1.17627240e+00,
          1.10686565e+00,  -2.49648685e-03,  -1.16427142e-01,
         -1.92861236e-01,  -1.72235882e-02,   8.01303434e-01,
         -1.20380609e-01,  -3.79992364e-01,   3.46331152e-01,
         -2.94243827e-02,  -5.23692071e-01,  -2.14951820e-01,
         -6.94435959e-01,   1.26141604e-01,   5.14719834e-01,
          4.40812862e-01,  -7.02697365e-01,  -4.17021143e-01,
          6.95699463e-01,  -5.52450232e-01,  -2.56036491e-01,
          2.13921119e-01,  -7.21418178e-01,   1.38511154e-01,
          2.83233178e-01,  -2.52862189e-01,  -7.56733162e-01,
          6.16353036e-01,   7.98831511e-01,   7.25176863e-01,
          1.49060020e-01,  -5.39511278e-01,   5.00701751e-01,
          6.87476966e-01,   5.02151090e-01,   7.75003544e-02,
          5.29439845e-01,  -4.32844063e-01,   3.45759078e-01,
          7.22694158e-01,  -4.79732106e-01,   2.75427579e-01,
         -2.23109312e-01,   4.73514038e-01,  -5.35813190e-01,
          5.86968871e-01],
       [  4.64394317e-01,  -8.37042734e-01,   9.85776370e-01,
         -6.13705349e-01,   5.02136819e-01,  -3.97373171e-01,
         -4.96272802e-01,  -6.64342435e-01,  -4.67888798e-01,
          1.17210990e-01,   2.89392977e-01,   3.65888036e-01,
          3.09575132e-01,  -1.64534944e-01,  -1.84334659e-01,
          7.98540017e-01,   1.98901925e-01,  -1.56367882e-01,
          3.32359740e-01,  -7.30951587e-01,  -4.83738299e-01,
         -7.84275962e-01,  -8.88534154e-01,  -4.57618194e-01,
         -4.38434579e-01,   8.69539583e-03,   8.68909911e-02,
          1.17015926e-01,  -1.83273747e-02,  -3.72424602e-01,
          2.39911900e-02,   1.47723435e-01,  -3.65752667e-01,
         -1.16116324e-01,   6.32490715e-01,   2.61780571e-01,
          5.80695422e-01,  -1.16988035e-01,  -4.76850956e-01,
         -4.00910504e-01,   6.70008200e-01,   5.69288407e-01,
         -9.19542853e-01,   6.86170697e-01,   3.40938644e-01,
          1.54358009e-01,   6.48573494e-01,  -3.81797147e-02,
          5.50115440e-02,   1.42127670e-01,   7.03024433e-01,
         -6.11726808e-01,  -8.16473599e-01,  -7.45945577e-01,
         -3.38363830e-01,   4.05749794e-01,  -5.69303506e-01,
         -9.57103499e-01,  -5.70173573e-01,  -2.98613936e-02,
         -2.13245385e-01,  -4.67888798e-01,   5.18459001e-01,
         -5.10603694e-01,   2.74347037e-01,  -3.76216424e-01,
          3.10444805e-01,  -3.16897364e-01,   8.93900418e-01,
         -7.81437593e-01],
       [  7.87707798e-01,   8.75657008e-01,  -3.59059532e-01,
         -5.29169003e-01,   6.30198716e-01,  -3.72332658e-01,
         -6.01776358e-01,   7.97246712e-01,  -4.31923255e-01,
         -2.01831924e-01,   4.00853790e-01,  -7.16188925e-01,
         -9.93761163e-01,  -9.90636856e-01,  -6.45538373e-01,
         -7.88901756e-01,   4.87100777e-01,   1.07590021e+00,
         -1.11127403e+00,  -3.00986635e-01,   8.10418869e-01,
          7.24340550e-01,   9.46177182e-01,  -6.82218811e-01,
         -7.34023837e-01,  -3.99261475e-03,   1.40087672e-01,
         -3.03734294e-01,  -4.10668659e-02,   6.65607031e-01,
          3.18787881e-01,   8.10359101e-02,   2.99937595e-01,
          5.97569734e-02,  -2.58559389e-01,  -2.39825401e-01,
         -6.30845985e-01,   1.61570908e-01,   5.92589368e-01,
          3.18860332e-01,  -6.77663584e-01,  -2.16998396e-01,
          4.63980673e-01,  -2.91630037e-01,   3.68094913e-02,
          1.99733050e-01,  -6.23992395e-01,  -6.08296369e-01,
          1.74529901e-01,  -7.25195005e-02,  -4.76603640e-01,
          5.13407886e-01,   5.66928638e-01,   4.41536723e-01,
          1.53862788e-01,  -2.70724760e-01,   3.95562753e-01,
          4.45550313e-01,   4.09398380e-01,   3.56090957e-02,
         -5.73096365e-01,  -4.31923255e-01,   3.52123956e-01,
          5.52201940e-01,  -3.03702179e-02,  -2.79101377e-01,
          3.11296279e-01,   7.65880068e-01,  -4.46119148e-01,
          4.33520530e-01],
       [  4.96510223e-01,  -9.06914820e-01,   1.74826601e-01,
         -4.25931013e-01,   5.15899332e-01,  -7.00207273e-01,
         -5.53194126e-01,   2.45513169e-01,  -4.19880359e-01,
          1.79231842e-01,   7.00264305e-01,   5.52119464e-01,
          5.31160217e-01,  -6.39882375e-02,   3.78613911e-02,
          7.55957405e-01,   2.87077750e-01,  -1.84303011e-01,
          4.25032749e-01,  -7.70544149e-01,  -4.79358015e-01,
         -6.38355204e-01,  -6.82378193e-01,  -4.04540479e-01,
         -5.30391695e-01,  -9.89871547e-03,   2.83757093e-01,
         -2.61276847e-01,  -2.91400865e-02,   5.80432025e-01,
          3.62974031e-01,   1.08706555e-01,   2.69252969e-01,
          4.10893166e-02,  -2.97774349e-01,  -1.29809957e-01,
         -6.28640438e-01,   2.57790467e-01,   5.17046597e-01,
          2.41577468e-01,  -5.84756456e-01,  -2.15507474e-01,
          4.77248295e-01,  -3.22610503e-01,  -4.18069379e-02,
          2.21114598e-01,  -6.01318367e-01,  -5.92905754e-01,
          1.92934829e-01,  -8.66548371e-02,  -4.95500583e-01,
          5.43956687e-01,   5.63694891e-01,   4.16734397e-01,
          1.35970431e-01,  -2.73048705e-01,   4.18444416e-01,
          4.58604464e-01,   4.15966190e-01,   1.80402883e-02,
         -4.55498796e-01,  -4.19880359e-01,   3.58219696e-01,
          6.41027041e-01,  -2.87367715e-01,  -3.12132698e-01,
          3.21204088e-01,   8.05640664e-01,  -4.94852325e-01,
          5.46421787e-01],
       [  5.42369658e-01,  -2.03884642e-02,   1.01332701e+00,
         -7.06107077e-01,   5.44402284e-01,  -6.69632506e-01,
         -3.08843676e-01,  -6.80354141e-01,   1.85547829e+00,
         -1.64232541e-01,   1.78880223e-01,  -4.29542721e-02,
         -3.69368431e-01,  -5.57264914e-01,  -4.41443632e-01,
          1.94214087e-01,   3.92956878e-01,   3.15483456e-01,
         -3.72311326e-01,  -4.72092792e-01,  -5.43689492e-02,
          9.86249258e-02,  -1.69026758e-02,  -5.59454257e-01,
         -4.31865869e-01,   5.72575629e-03,  -4.51624334e-02,
          4.31422011e-01,   9.32992605e-02,  -1.29051446e+00,
         -1.48827764e-01,   3.02939660e-01,  -4.28486477e-01,
         -2.57655475e-02,   2.54439709e-01,  -2.81029343e-02,
          9.05262269e-01,  -1.23439730e-01,  -5.23109408e-01,
         -4.89403447e-01,   7.42543826e-01,  -5.08464838e-02,
          1.72637552e-02,   4.70493120e-02,   2.49468038e-01,
         -5.90507633e-01,   8.46213705e-01,   3.01151016e-01,
         -5.70632844e-01,   2.07459241e-01,   5.74533783e-01,
         -2.90820563e-01,  -4.38212798e-01,  -4.15114598e-01,
          1.01296873e-01,   4.86704006e-01,  -1.72182695e-01,
          5.53068414e-02,  -1.24767032e-01,  -2.86124116e-01,
         -2.82205358e-01,   1.85547829e+00,  -1.72763833e+00,
         -8.21921499e-01,   4.58935882e-01,  -3.22919575e-01,
          3.29626869e-01,  -1.78702807e-01,  -2.42340210e-01,
         -9.07281344e-02],
       [  6.94000302e-01,   7.98406594e-01,   4.45170391e-01,
         -5.72762608e-01,   4.27242855e-01,   2.46173165e-02,
         -4.45123150e-01,  -2.37093879e-01,  -4.55627931e-01,
         -2.46436746e-01,  -4.69741622e-02,  -8.11215121e-01,
         -1.04032695e+00,  -9.67571691e-01,  -6.58893414e-01,
         -6.06555762e-01,   3.44739857e-01,   9.82901713e-01,
         -1.02235084e+00,  -2.26382583e-01,   6.90632524e-01,
          4.77228570e-01,   6.07186556e-01,  -6.60950575e-01,
         -5.52950524e-01,  -1.23323220e-03,  -2.17292654e-01,
          5.69014357e-02,  -4.13490102e-02,  -2.40227511e-01,
         -9.83491383e-02,   1.50711691e-01,  -3.57442281e-01,
         -1.03014868e-01,   6.35931279e-01,   1.86371831e-01,
          5.54784942e-01,  -1.08083547e-01,  -4.00493704e-01,
         -3.93009079e-01,   6.20502595e-01,   5.95189153e-01,
         -9.48418003e-01,   7.22105331e-01,   3.69969869e-01,
          1.72096528e-01,   6.27531021e-01,  -4.68366971e-02,
          6.93159553e-02,   1.55051717e-01,   6.91287438e-01,
         -6.21212726e-01,  -7.98545135e-01,  -7.22759831e-01,
         -3.13787616e-01,   4.10292307e-01,  -5.52227331e-01,
         -9.69794508e-01,  -5.69987422e-01,   3.62107367e-02,
         -1.82763378e-01,  -4.55627931e-01,   5.12802846e-01,
         -4.69107111e-01,   2.79209399e-01,  -3.64811286e-01,
          3.18309989e-01,  -3.15719093e-01,   9.23604374e-01,
         -8.37411325e-01],
       [ -1.00038765e+00,  -1.17635026e+00,  -8.23629972e-01,
          1.11027574e+00,  -9.81022249e-01,   4.78254696e-01,
          7.84032980e-01,   3.12305921e-01,  -2.32562203e-01,
          1.07169869e-01,  -4.92618750e-01,   1.22633857e+00,
          1.42530740e+00,   1.25773020e+00,   7.46599206e-01,
          9.56032298e-01,  -6.99595233e-01,  -1.41047277e+00,
          1.49025031e+00,   1.84770595e-01,  -1.06980396e+00,
         -9.71235082e-01,  -1.04655474e+00,   8.02998266e-01,
          7.30659823e-01,  -6.74604815e-04,  -3.54685651e-03,
         -2.80106801e-01,  -3.69908178e-02,   7.18172516e-01,
          1.19846404e-01,  -4.24724882e-01,   4.44115619e-01,
          3.46649956e-02,  -5.78717069e-01,  -2.34798896e-01,
         -7.51461334e-01,   3.12937001e-02,   5.82389828e-01,
          5.26976856e-01,  -7.76579084e-01,  -4.42843780e-01,
          6.81994966e-01,  -5.77982816e-01,  -3.71316765e-01,
          2.01981791e-01,  -7.91624699e-01,   1.90216574e-01,
          2.39109310e-01,  -2.96394128e-01,  -8.29032953e-01,
          5.54086499e-01,   8.41744816e-01,   8.30462386e-01,
          1.58929470e-01,  -6.23356719e-01,   5.35861058e-01,
          7.19543197e-01,   4.93778692e-01,   2.21427414e-01,
          8.43110209e-01,  -2.32562203e-01,   3.11054743e-01,
          7.72677865e-01,  -5.67658156e-01,   4.64311693e-01,
         -3.89404032e-01,   3.78699478e-01,  -5.00905830e-01,
          6.03473009e-01]])
In [297]:
scaler.inverse_transform(pca.inverse_transform(model_10.cluster_centers_))
Out[297]:
array([[  2.81165489e+00,   7.84509948e-01,   4.31865683e+00,
          2.15739332e+00,   3.81688509e+00,   2.21323311e+00,
          2.79685293e+00,   1.85316216e+00,   6.35209181e-01,
          1.94269525e+00,   3.14979547e+00,   5.94133704e+00,
          6.06814709e+00,   4.97132294e+00,   5.05599694e+00,
          7.12137286e+00,   4.56546838e+00,   2.49044499e+00,
          6.37955225e+00,   3.20792410e+00,   1.85268793e+00,
          2.27595171e+00,   1.31904404e+00,   4.11791614e+00,
          4.04074705e+00,   1.10519563e-02,   1.37277333e+00,
          2.66781991e+00,   1.64966401e-02,   2.49126221e+00,
          3.78074080e+00,   8.42702877e+00,   3.85368991e-01,
          1.22484867e-02,   3.74130990e+00,   1.99379578e+03,
          2.72179247e+00,   1.11693538e+00,   7.98129218e-02,
         -2.85110886e-02,   4.15214065e+00,   4.96471934e+00,
          3.05316938e+00,   5.60173380e+00,   4.08562583e+00,
          2.38043772e+00,   4.11783442e+00,   3.95786593e+00,
          3.88448896e+00,   7.57345193e+02,   3.03201057e+00,
          2.41732223e+00,   9.08920217e-01,   1.77434032e-01,
          3.57146748e+00,   4.06822281e+00,   2.46878550e+00,
          4.02588503e+00,   2.40694272e+00,   4.34359421e-02,
          6.95751930e+01,   6.35209181e-01,   3.95729556e-01,
          1.84084651e+00,   3.72197353e+00,  -1.01865162e-02,
          9.82462414e-01,   1.00313695e+00,   4.00308021e-01,
          1.44842298e+00],
       [  3.72394137e+00,   1.34839684e+00,   3.89533990e+00,
          1.27752678e+00,   4.72427590e+00,   1.12825353e+00,
          1.47371684e+00,   3.67440913e+00,   3.87923026e-01,
          2.05356222e+00,   4.44201515e+00,   4.41994744e+00,
          3.78459046e+00,   2.78259016e+00,   3.03697601e+00,
          4.98385059e+00,   6.12638911e+00,   5.28011565e+00,
          3.45647599e+00,   2.50872252e+00,   4.25608859e+00,
          4.15358653e+00,   3.99603033e+00,   2.61828992e+00,
          2.38728855e+00,   8.58983558e-03,   1.65856364e+00,
          1.89455522e+00,   3.70851352e-02,   4.17825161e+00,
          5.06598426e+00,   8.77538669e+00,   1.76396785e+01,
          3.23128894e-01,   1.34244350e+00,   1.99250642e+03,
          9.13246037e-01,   8.73658070e-01,   4.42841998e-01,
          6.27965366e-01,   2.56101967e+00,   2.56961607e+00,
          5.53937329e+00,   2.80616783e+00,   2.69472946e+00,
          2.24112049e+00,   2.19022818e+00,   2.29845334e+00,
          4.21593599e+00,   4.76683186e+02,   1.38868673e+00,
          3.07149311e+00,   2.33384637e+00,   1.35427857e+00,
          3.76110792e+00,   2.49050199e+00,   3.37235833e+00,
          7.47725926e+00,   3.59533321e+00,   2.06415507e-01,
          5.31144105e+01,   3.87923026e-01,   5.57210922e-01,
          3.70223009e+00,   2.53239351e+00,   7.10089908e-01,
          2.27699557e-01,   1.68022831e-01,  -8.78344045e-03,
          3.11957404e+00],
       [  2.75795386e+00,   1.76166908e+00,   3.71785421e+00,
          2.14480565e+00,   3.85700534e+00,   2.69062017e+00,
          2.73947311e+00,   2.45912394e+00,   4.54913147e-01,
          2.11113228e+00,   3.19502638e+00,   3.32599711e+00,
          3.57260579e+00,   4.13668872e+00,   4.55706544e+00,
          3.62092061e+00,   4.59100192e+00,   4.65656464e+00,
          3.77532270e+00,   4.72940476e+00,   4.84347745e+00,
          5.19542205e+00,   4.85192344e+00,   4.47597264e+00,
          4.34623529e+00,   1.00230397e-02,   1.40834507e+00,
          2.54410998e+00,   1.91035029e-02,   2.80257278e+00,
          3.51151509e+00,   8.42498148e+00,   3.30783842e-01,
          2.53915535e-03,   3.75434466e+00,   1.99388453e+03,
          2.70873330e+00,   1.19903205e+00,   1.78660528e-02,
         -2.76292758e-02,   4.18930965e+00,   5.10066115e+00,
          2.97300599e+00,   5.70216342e+00,   4.06591666e+00,
          2.36490248e+00,   4.13277001e+00,   3.70903522e+00,
          3.87883606e+00,   7.65587541e+02,   3.06987923e+00,
          2.39537616e+00,   8.56013622e-01,   1.34848294e-01,
          3.54649793e+00,   4.10467375e+00,   2.39623445e+00,
          3.85276999e+00,   2.29063675e+00,   3.87355389e-02,
          6.49510024e+01,   4.54913147e-01,   4.86749696e-01,
          1.88061041e+00,   3.68106081e+00,  -2.01654032e-02,
          9.89248206e-01,   9.80900594e-01,   4.09508566e-01,
          1.42026174e+00],
       [  2.62857504e+00,   1.79626540e+00,   2.83901091e+00,
          2.32784987e+00,   3.80995011e+00,   2.53822324e+00,
          2.62277497e+00,   3.73812442e+00,   2.80101121e-01,
          2.04838252e+00,   3.33467890e+00,   3.64498508e+00,
          3.78617243e+00,   4.33533158e+00,   4.72693914e+00,
          3.31837975e+00,   4.34411565e+00,   4.50139899e+00,
          3.98159030e+00,   4.91133039e+00,   4.90951063e+00,
          5.41435521e+00,   5.38799676e+00,   4.71526671e+00,
          4.44798871e+00,   9.59654654e-03,   1.44335666e+00,
          2.00067518e+00,   1.77344007e-02,   4.58036750e+00,
          4.14011714e+00,   8.20930030e+00,   9.93715335e+00,
          5.04440795e-02,   2.38402704e+00,   1.99239456e+03,
          1.12301405e+00,   1.29492606e+00,   6.83483108e-01,
          3.53011804e-01,   2.79522682e+00,   3.42034129e+00,
          5.00069010e+00,   3.70113366e+00,   3.63994187e+00,
          2.67538534e+00,   2.53221084e+00,   3.18392949e+00,
          4.68065352e+00,   5.86674242e+02,   1.84636039e+00,
          3.25049746e+00,   2.10426578e+00,   9.89730150e-01,
          3.76059090e+00,   3.06908210e+00,   3.33436804e+00,
          6.60056934e+00,   3.60988136e+00,   1.01064831e-01,
          6.60348075e+01,   2.80101121e-01,   6.67135241e-01,
          3.62888217e+00,   2.71172982e+00,   1.42843221e-01,
          8.43809795e-01,   1.54422896e+00,   9.03248552e-03,
          2.92239443e+00],
       [  3.82519587e+00,   9.36913054e-01,   5.26733205e+00,
          9.19329742e-01,   4.98803123e+00,   1.20387519e+00,
          1.38030760e+00,   1.85347924e+00,   2.62579179e-01,
          2.07044719e+00,   4.28372162e+00,   4.91991591e+00,
          4.41544603e+00,   2.94797252e+00,   3.12264856e+00,
          6.60055359e+00,   5.84137252e+00,   4.59674598e+00,
          4.49579838e+00,   1.86486746e+00,   3.05457609e+00,
          2.75198526e+00,   2.00460993e+00,   2.37609937e+00,
          2.29605982e+00,   1.07014366e-02,   1.54501522e+00,
          2.43184749e+00,   1.75663728e-02,   2.63796173e+00,
          4.41059114e+00,   8.81752900e+00,  -2.48662370e-01,
          3.50734852e-03,   4.04950032e+00,   1.99386300e+03,
          2.94046300e+00,   1.02979222e+00,  -6.15697814e-02,
         -4.06812546e-02,   4.37421351e+00,   5.50591406e+00,
          2.40378656e+00,   6.13037554e+00,   4.13559933e+00,
          2.61685362e+00,   4.39894968e+00,   2.94500761e+00,
          4.23443741e+00,   7.23445891e+02,   3.15865000e+00,
          2.22712725e+00,   6.82306482e-01,   5.53392116e-02,
          3.30800194e+00,   4.01634717e+00,   2.25351594e+00,
          3.05649372e+00,   2.08953421e+00,   7.19324558e-02,
          5.52375876e+01,   2.62579179e-01,   7.53479526e-01,
          1.88908058e+00,   3.72710307e+00,  -2.53382296e-02,
          9.97852176e-01,   9.47377393e-01,   6.16852208e-01,
          8.87597842e-01],
       [  4.06118521e+00,   1.74268975e+00,   3.89820522e+00,
          9.89866925e-01,   5.09640372e+00,   1.22897976e+00,
          1.27705913e+00,   3.77797188e+00,   2.80561514e-01,
          1.82232391e+00,   4.41560049e+00,   3.16796951e+00,
          2.22147817e+00,   1.74196041e+00,   2.20841532e+00,
          3.80496122e+00,   6.31250498e+00,   6.75571786e+00,
          1.96559057e+00,   2.46743640e+00,   5.35703004e+00,
          5.15354915e+00,   5.29524749e+00,   2.05454882e+00,
          1.88443297e+00,   9.44884513e-03,   1.57161343e+00,
          1.84640310e+00,   1.41047713e-02,   4.35580315e+00,
          4.96287933e+00,   8.74066705e+00,   9.27352891e+00,
          9.87286289e-02,   2.76594883e+00,   1.99231794e+03,
          1.21364905e+00,   1.33356186e+00,   7.41993224e-01,
          2.95971847e-01,   2.82402251e+00,   3.84329369e+00,
          4.62814455e+00,   4.21266655e+00,   3.88308650e+00,
          2.66144295e+00,   2.66496237e+00,   2.17409431e+00,
          4.46811822e+00,   6.49120825e+02,   2.09819060e+00,
          3.16471232e+00,   1.90012079e+00,   8.09574665e-01,
          3.76505043e+00,   3.33843846e+00,   3.22816323e+00,
          6.07921671e+00,   3.47837611e+00,   8.96977297e-02,
          5.00060440e+01,   2.80561514e-01,   6.70317471e-01,
          3.38837042e+00,   3.31679899e+00,  -2.74008278e-04,
          9.98098005e-01,   1.76499896e+00,   4.71644631e-02,
          2.69421942e+00],
       [  3.84863755e+00,   9.04040217e-01,   4.44173460e+00,
          1.07600877e+00,   4.99967777e+00,   9.00266348e-01,
          1.32460294e+00,   3.05149745e+00,   2.86582816e-01,
          2.11868151e+00,   4.76985879e+00,   5.22143558e+00,
          4.78845073e+00,   3.09475894e+00,   3.56310254e+00,
          6.52556273e+00,   5.98551777e+00,   4.54780277e+00,
          4.65822321e+00,   1.80938097e+00,   3.06236912e+00,
          2.98427626e+00,   2.37435994e+00,   2.45208834e+00,
          2.16800368e+00,   8.86578043e-03,   1.64344779e+00,
          1.90547966e+00,   1.59203685e-02,   4.21484680e+00,
          5.04566008e+00,   8.77255937e+00,   8.83460875e+00,
          8.86215905e-02,   2.70945995e+00,   1.99265681e+03,
          1.21679262e+00,   1.43848967e+00,   6.85231406e-01,
          2.59824903e-01,   2.93089111e+00,   3.84644628e+00,
          4.64947555e+00,   4.15190620e+00,   3.81781272e+00,
          2.68245426e+00,   2.69585781e+00,   2.19490554e+00,
          4.50410330e+00,   6.44226235e+02,   2.08120267e+00,
          3.19016892e+00,   1.89727411e+00,   7.93821341e-01,
          3.74843679e+00,   3.33610959e+00,   3.25127685e+00,
          6.10734844e+00,   3.48768799e+00,   8.49304723e-02,
          5.17156872e+01,   2.86582816e-01,   6.73365141e-01,
          3.51367514e+00,   2.97074992e+00,  -8.79899368e-03,
          1.00095849e+00,   1.79502278e+00,   2.64463375e-02,
          2.86210162e+00],
       [  3.88211075e+00,   1.32112575e+00,   5.29538032e+00,
          8.42229679e-01,   5.02379842e+00,   9.30919334e-01,
          1.56373053e+00,   1.83239642e+00,   1.42423455e+00,
          1.85156538e+00,   4.15296449e+00,   4.25797611e+00,
          3.27254810e+00,   2.37463272e+00,   2.61298751e+00,
          5.53629474e+00,   6.15860345e+00,   5.42344415e+00,
          3.26074537e+00,   2.22764184e+00,   3.81847338e+00,
          4.15747359e+00,   3.56792031e+00,   2.23030528e+00,
          2.30520717e+00,   1.04082665e-02,   1.47898884e+00,
          2.86932149e+00,   3.45591421e-02,   1.11861229e+00,
          4.08682269e+00,   8.99642636e+00,  -1.14602165e+00,
          5.24250452e-02,   3.50492047e+00,   1.99297010e+03,
          3.40306920e+00,   1.02275663e+00,  -9.63277573e-02,
         -8.20714060e-02,   4.45764932e+00,   4.19462549e+00,
          3.90993606e+00,   4.87690049e+00,   4.05965297e+00,
          1.88488600e+00,   4.66825253e+00,   3.40385156e+00,
          3.01118526e+00,   7.46068008e+02,   3.04313977e+00,
          2.49454137e+00,   1.01529092e+00,   2.65468199e-01,
          3.71624126e+00,   4.09747301e+00,   2.65466244e+00,
          5.23824091e+00,   2.72103378e+00,   2.39612739e-03,
          5.42350417e+01,   1.42423455e+00,  -3.69495315e-01,
          1.44990750e+00,   3.97565334e+00,  -1.15829563e-02,
          1.00339023e+00,   1.05173019e+00,   1.33797798e-01,
          1.91467078e+00],
       [  3.99278729e+00,   1.70634562e+00,   4.71696144e+00,
          9.53492392e-01,   4.92465216e+00,   1.62694524e+00,
          1.43036396e+00,   2.41604273e+00,   2.68709464e-01,
          1.78763424e+00,   3.88573676e+00,   3.01411650e+00,
          2.14309177e+00,   1.77563285e+00,   2.18194193e+00,
          4.12608486e+00,   6.07978077e+00,   6.59278161e+00,
          2.12144322e+00,   2.57198930e+00,   5.14391641e+00,
          4.76017204e+00,   4.68725250e+00,   2.08499759e+00,
          2.13658908e+00,   9.72125812e-03,   1.39292415e+00,
          2.34820240e+00,   1.40618209e-02,   2.85673506e+00,
          4.18139210e+00,   8.82097317e+00,  -1.29788637e-01,
          1.06007356e-02,   4.05445643e+00,   1.99363073e+03,
          2.90353271e+00,   1.03950260e+00,  -4.19597406e-03,
         -3.69855794e-02,   4.31726853e+00,   5.56068175e+00,
          2.35736258e+00,   6.20085224e+00,   4.15970342e+00,
          2.63428498e+00,   4.37027739e+00,   2.93330162e+00,
          4.26240522e+00,   7.27921052e+02,   3.14809871e+00,
          2.21922254e+00,   6.98088978e-01,   7.00657567e-02,
          3.33082176e+00,   4.02089931e+00,   2.27076521e+00,
          3.02914456e+00,   2.08979814e+00,   8.98609837e-02,
          5.56807375e+01,   2.68709464e-01,   7.50651635e-01,
          1.94761941e+00,   3.73365028e+00,  -2.23947011e-02,
          1.00012293e+00,   9.48267123e-01,   6.29480367e-01,
          8.04365877e-01],
       [  2.75603864e+00,   7.77278480e-01,   3.42524354e+00,
          2.35782070e+00,   3.73290669e+00,   2.08174316e+00,
          2.63324752e+00,   3.13944430e+00,   3.80239623e-01,
          2.06263810e+00,   3.35845636e+00,   6.31303606e+00,
          6.29361113e+00,   5.02431318e+00,   4.96801679e+00,
          6.87790816e+00,   4.37255621e+00,   2.39951536e+00,
          6.52519378e+00,   3.14819472e+00,   2.01189810e+00,
          2.45436538e+00,   1.72119288e+00,   4.18086699e+00,
          3.92409816e+00,   9.77640718e-03,   1.49979652e+00,
          1.87927910e+00,   1.47252625e-02,   4.44279392e+00,
          4.59017152e+00,   8.15774300e+00,   1.13358855e+01,
          8.51433360e-02,   2.30476391e+00,   1.99233342e+03,
          1.04173560e+00,   1.19149405e+00,   7.34329428e-01,
          3.93312653e-01,   2.71024264e+00,   3.36573877e+00,
          4.97865672e+00,   3.65105795e+00,   3.54422678e+00,
          2.66365276e+00,   2.43654804e+00,   3.25384570e+00,
          4.59438308e+00,   5.71600601e+02,   1.78136449e+00,
          3.19861018e+00,   2.14204253e+00,   1.05660279e+00,
          3.76975501e+00,   2.98505891e+00,   3.36988376e+00,
          6.66967216e+00,   3.59801093e+00,   1.40119123e-01,
          7.05949735e+01,   3.80239623e-01,   6.49784215e-01,
          3.69939371e+00,   2.59333673e+00,   1.91591930e-01,
          7.95798855e-01,   1.47263306e+00,   2.38727875e-02,
          2.94693579e+00]])
In [298]:
# What kinds of people are part of a cluster that is overrepresented  and underrepresented in the
# customer data compared to the general population?

# adapted from:
# https://github.com/chauhan-nitin/Udacity-IdentifyCustomerSegments-Arvato/blob/master/Identify_Customer_Segments.ipynb

# Cluster center specs can be found below:
cluster_centroids = pd.DataFrame(scaler.inverse_transform(pca.inverse_transform(model_10.cluster_centers_)), columns=columns_list_customers)
cluster_centroids

# look at cluster centroids for 4 (underrepresented) and 3 (overrepresented)

# I exported to Excel, compared Excel vs data dictionary
Out[298]:
ALTERSKATEGORIE_GROB ANREDE_KZ FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER GREEN_AVANTGARDE HEALTH_TYP RETOURTYP_BK_S SEMIO_SOZ SEMIO_FAM SEMIO_REL SEMIO_MAT SEMIO_VERT SEMIO_LUST SEMIO_ERL SEMIO_KULT SEMIO_RAT SEMIO_KRIT SEMIO_DOM SEMIO_KAEM SEMIO_PFLICHT SEMIO_TRADV SOHO_KZ VERS_TYP ANZ_PERSONEN ANZ_TITEL HH_EINKOMMEN_SCORE W_KEIT_KIND_HH WOHNDAUER_2008 ANZ_HAUSHALTE_AKTIV ANZ_HH_TITEL KONSUMNAEHE MIN_GEBAEUDEJAHR KBA05_ANTG1 KBA05_ANTG2 KBA05_ANTG3 KBA05_ANTG4 KBA05_GBZ BALLRAUM EWDICHTE INNENSTADT GEBAEUDETYP_RASTER KKK MOBI_REGIO ONLINE_AFFINITAET REGIOTYP KBA13_ANZAHL_PKW PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB OST_WEST_KZ_O PRAEGENDE_JUGENDJAHRE_DECADE PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM CAMEO_INTL_2015_WEALTH CAMEO_INTL_2015_LIFE_STAGE_TYP PLZ8_BAUMAX_BLDNG_TYPE_BUSINESS PLZ8_BAUMAX_BLDNG_TYPE_FAMILY PLZ8_BAUMAX_FAMILY_HOMES WOHNLAGE_RURAL_FLAG WOHNLAGE_CITY_NEIGHBOURHOOD
0 2.8117 0.7845 4.3187 2.1574 3.8169 2.2132 2.7969 1.8532 0.6352 1.9427 3.1498 5.9413 6.0681 4.9713 5.0560 7.1214 4.5655 2.4904 6.3796 3.2079 1.8527 2.2760 1.3190 4.1179 4.0407 0.0111 1.3728 2.6678 0.0165 2.4913 3.7807 8.4270 0.3854 0.0122 3.7413 1993.7958 2.7218 1.1169 0.0798 -0.0285 4.1521 4.9647 3.0532 5.6017 4.0856 2.3804 4.1178 3.9579 3.8845 757.3452 3.0320 2.4173 0.9089 0.1774 3.5715 4.0682 2.4688 4.0259 2.4069 0.0434 69.5752 0.6352 0.3957 1.8408 3.7220 -0.0102 0.9825 1.0031 0.4003 1.4484
1 3.7239 1.3484 3.8953 1.2775 4.7243 1.1283 1.4737 3.6744 0.3879 2.0536 4.4420 4.4199 3.7846 2.7826 3.0370 4.9839 6.1264 5.2801 3.4565 2.5087 4.2561 4.1536 3.9960 2.6183 2.3873 0.0086 1.6586 1.8946 0.0371 4.1783 5.0660 8.7754 17.6397 0.3231 1.3424 1992.5064 0.9132 0.8737 0.4428 0.6280 2.5610 2.5696 5.5394 2.8062 2.6947 2.2411 2.1902 2.2985 4.2159 476.6832 1.3887 3.0715 2.3338 1.3543 3.7611 2.4905 3.3724 7.4773 3.5953 0.2064 53.1144 0.3879 0.5572 3.7022 2.5324 0.7101 0.2277 0.1680 -0.0088 3.1196
2 2.7580 1.7617 3.7179 2.1448 3.8570 2.6906 2.7395 2.4591 0.4549 2.1111 3.1950 3.3260 3.5726 4.1367 4.5571 3.6209 4.5910 4.6566 3.7753 4.7294 4.8435 5.1954 4.8519 4.4760 4.3462 0.0100 1.4083 2.5441 0.0191 2.8026 3.5115 8.4250 0.3308 0.0025 3.7543 1993.8845 2.7087 1.1990 0.0179 -0.0276 4.1893 5.1007 2.9730 5.7022 4.0659 2.3649 4.1328 3.7090 3.8788 765.5875 3.0699 2.3954 0.8560 0.1348 3.5465 4.1047 2.3962 3.8528 2.2906 0.0387 64.9510 0.4549 0.4867 1.8806 3.6811 -0.0202 0.9892 0.9809 0.4095 1.4203
3 2.6286 1.7963 2.8390 2.3278 3.8100 2.5382 2.6228 3.7381 0.2801 2.0484 3.3347 3.6450 3.7862 4.3353 4.7269 3.3184 4.3441 4.5014 3.9816 4.9113 4.9095 5.4144 5.3880 4.7153 4.4480 0.0096 1.4434 2.0007 0.0177 4.5804 4.1401 8.2093 9.9372 0.0504 2.3840 1992.3946 1.1230 1.2949 0.6835 0.3530 2.7952 3.4203 5.0007 3.7011 3.6399 2.6754 2.5322 3.1839 4.6807 586.6742 1.8464 3.2505 2.1043 0.9897 3.7606 3.0691 3.3344 6.6006 3.6099 0.1011 66.0348 0.2801 0.6671 3.6289 2.7117 0.1428 0.8438 1.5442 0.0090 2.9224
4 3.8252 0.9369 5.2673 0.9193 4.9880 1.2039 1.3803 1.8535 0.2626 2.0704 4.2837 4.9199 4.4154 2.9480 3.1226 6.6006 5.8414 4.5967 4.4958 1.8649 3.0546 2.7520 2.0046 2.3761 2.2961 0.0107 1.5450 2.4318 0.0176 2.6380 4.4106 8.8175 -0.2487 0.0035 4.0495 1993.8630 2.9405 1.0298 -0.0616 -0.0407 4.3742 5.5059 2.4038 6.1304 4.1356 2.6169 4.3989 2.9450 4.2344 723.4459 3.1587 2.2271 0.6823 0.0553 3.3080 4.0163 2.2535 3.0565 2.0895 0.0719 55.2376 0.2626 0.7535 1.8891 3.7271 -0.0253 0.9979 0.9474 0.6169 0.8876
5 4.0612 1.7427 3.8982 0.9899 5.0964 1.2290 1.2771 3.7780 0.2806 1.8223 4.4156 3.1680 2.2215 1.7420 2.2084 3.8050 6.3125 6.7557 1.9656 2.4674 5.3570 5.1535 5.2952 2.0545 1.8844 0.0094 1.5716 1.8464 0.0141 4.3558 4.9629 8.7407 9.2735 0.0987 2.7659 1992.3179 1.2136 1.3336 0.7420 0.2960 2.8240 3.8433 4.6281 4.2127 3.8831 2.6614 2.6650 2.1741 4.4681 649.1208 2.0982 3.1647 1.9001 0.8096 3.7651 3.3384 3.2282 6.0792 3.4784 0.0897 50.0060 0.2806 0.6703 3.3884 3.3168 -0.0003 0.9981 1.7650 0.0472 2.6942
6 3.8486 0.9040 4.4417 1.0760 4.9997 0.9003 1.3246 3.0515 0.2866 2.1187 4.7699 5.2214 4.7885 3.0948 3.5631 6.5256 5.9855 4.5478 4.6582 1.8094 3.0624 2.9843 2.3744 2.4521 2.1680 0.0089 1.6434 1.9055 0.0159 4.2148 5.0457 8.7726 8.8346 0.0886 2.7095 1992.6568 1.2168 1.4385 0.6852 0.2598 2.9309 3.8464 4.6495 4.1519 3.8178 2.6825 2.6959 2.1949 4.5041 644.2262 2.0812 3.1902 1.8973 0.7938 3.7484 3.3361 3.2513 6.1073 3.4877 0.0849 51.7157 0.2866 0.6734 3.5137 2.9707 -0.0088 1.0010 1.7950 0.0264 2.8621
7 3.8821 1.3211 5.2954 0.8422 5.0238 0.9309 1.5637 1.8324 1.4242 1.8516 4.1530 4.2580 3.2725 2.3746 2.6130 5.5363 6.1586 5.4234 3.2607 2.2276 3.8185 4.1575 3.5679 2.2303 2.3052 0.0104 1.4790 2.8693 0.0346 1.1186 4.0868 8.9964 -1.1460 0.0524 3.5049 1992.9701 3.4031 1.0228 -0.0963 -0.0821 4.4576 4.1946 3.9099 4.8769 4.0597 1.8849 4.6683 3.4039 3.0112 746.0680 3.0431 2.4945 1.0153 0.2655 3.7162 4.0975 2.6547 5.2382 2.7210 0.0024 54.2350 1.4242 -0.3695 1.4499 3.9757 -0.0116 1.0034 1.0517 0.1338 1.9147
8 3.9928 1.7063 4.7170 0.9535 4.9247 1.6269 1.4304 2.4160 0.2687 1.7876 3.8857 3.0141 2.1431 1.7756 2.1819 4.1261 6.0798 6.5928 2.1214 2.5720 5.1439 4.7602 4.6873 2.0850 2.1366 0.0097 1.3929 2.3482 0.0141 2.8567 4.1814 8.8210 -0.1298 0.0106 4.0545 1993.6307 2.9035 1.0395 -0.0042 -0.0370 4.3173 5.5607 2.3574 6.2009 4.1597 2.6343 4.3703 2.9333 4.2624 727.9211 3.1481 2.2192 0.6981 0.0701 3.3308 4.0209 2.2708 3.0291 2.0898 0.0899 55.6807 0.2687 0.7507 1.9476 3.7337 -0.0224 1.0001 0.9483 0.6295 0.8044
9 2.7560 0.7773 3.4252 2.3578 3.7329 2.0817 2.6332 3.1394 0.3802 2.0626 3.3585 6.3130 6.2936 5.0243 4.9680 6.8779 4.3726 2.3995 6.5252 3.1482 2.0119 2.4544 1.7212 4.1809 3.9241 0.0098 1.4998 1.8793 0.0147 4.4428 4.5902 8.1577 11.3359 0.0851 2.3048 1992.3334 1.0417 1.1915 0.7343 0.3933 2.7102 3.3657 4.9787 3.6511 3.5442 2.6637 2.4365 3.2538 4.5944 571.6006 1.7814 3.1986 2.1420 1.0566 3.7698 2.9851 3.3699 6.6697 3.5980 0.1401 70.5950 0.3802 0.6498 3.6994 2.5933 0.1916 0.7958 1.4726 0.0239 2.9469
In [299]:
# cluster_centroids.to_excel('customer_cluster_centroids.xlsx')

Discussion 3.3: Compare Customer Data to Demographics Data

(Double-click this cell and replace this text with your own text, reporting findings and conclusions from the clustering analysis. Can we describe segments of the population that are relatively popular with the mail-order company, or relatively unpopular with the company?)

Some things I'd do differently if I were to do this analysis again

I regret dropping some variables from the analysis, such as LP_STATUS_GROB (social status), SHOPPER_TYP (shopper type), and some other mixed type variables indicating social status and shopping habits. Super important.

These are multilevel vars. I should have just recoded them as dummies. This is very important info for marketing, even critical.

I think even 10 clusters is overkill, hard to analyze and describe. Maybe 7 would have been better for marketing?

The way I was creating calcualated variables was inefficient. I was creating temp vars and then dropping them. I could take a look at how other people were doing this.

I didn't need to run K means with 2,3,4...30 clusters on all data. 25% random sample/subset is sufficient for that.

Target audience

Cluster 4

Some of these are repetitive, or may slightly differ: middle class vs upper middle class etc, because we have several vars describing the same thing. It's good to check if the cluster makes sense and the data is good, to make sure there are no big contradictions

  • male
  • money saver
  • investor
  • home owner
  • likely to be >60 years old
  • shopper type: conservative Low-Returner
  • not too social
  • not dreamful
  • religious
  • average to low level of being cultured
  • critically-minded
  • dominant
  • very rational
  • with a combative attitude
  • dutiful
  • traditinal
  • very high or high income
  • average or unlikely to have kids in household
  • likely to have lived in the same place for 10 or more years -(?? If I'm intepreting WOHNDAUER_2008 var correctly. Also, is this old data?)
  • live 20 - 30 km to city center
  • low movement pattern
  • high online affinity (this is surprising because I thought these were older people? Also, I'm not sure what this actually means. If they use internet?)
  • midle class neighbourhood (interesting, I thought it was high or very high income. Why not a higher class neighbourhood)
  • upper middle class
  • very good neighbourhood

We could also describe cluster 7 and 6

Not target audience

Cluster 3

  • female
  • hard to tell age. In between these 2 groups: 30-45 adn 46-60
  • saver, but less so than top cluster in target audience
  • not a home owner
  • average to low level of religiousness
  • not materialistic
  • dreamful
  • average level of being cultured
  • not critically minded
  • not dominant
  • not a combative attitude
  • not dutiful
  • not traditional
  • average neighbourhood
  • both of these classes: established middleclass and consumption-oriented middleclass
  • 5 - 10 km to city center (or less)
  • middle to high movement pattern, in between
  • online affinity is high, higher than that of the top cluster in the target audience
  • moddle class to lower middle class neighbourhood

We can also describe clustesrs 9 and 2.

Congratulations on making it this far in the project! Before you finish, make sure to check through the entire notebook from top to bottom to make sure that your analysis follows a logical flow and all of your findings are documented in Discussion cells. Once you've checked over all of your work, you should export the notebook as an HTML document to submit for evaluation. You can do this from the menu, navigating to File -> Download as -> HTML (.html). You will submit both that document and this notebook for your project submission.

In [ ]: